Hmmm, ok. The replication failure could lead to the scenario I
outlined, but that's a secondary issue to the update not getting to
the follower in the first place as you say.
On Tue, Nov 6, 2018 at 12:19 PM Jeremy Smith <jas2...@cornell.edu> wrote:
>
> Thanks everyone.  I added SOLR-12969.
>
>
> Erick - those sound like important questions, but I think this issue is 
> slightly different.  In this case, replication is failing even if the leader 
> never goes down.
>
> ________________________________
> From: Erick Erickson <erickerick...@gmail.com>
> Sent: Tuesday, November 6, 2018 2:52:30 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> Kevin:
>
> Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
> I _think_ the new LIR work done in Solr 7.3 might make it possible to
> detect this condition but I'm not totally sure what to do about it.
>
> So let's say the leader gets an update while a follower is down. (one
> leader and one follower for simplicity). Now say the leader dies and
> the follower is restarted. What should happen? Should Solr refuse to
> start? Would FORCELEADER work if the user was willing to lose data?
>
> Let's move the discussion to the JIRA though.
> On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden <kris...@apache.org> wrote:
> >
> > Erick Erickson - I don't have much time to chase this down. Do you think
> > this a blocker for 7.6? It seems pretty serious.
> >
> > Jeremy - This would be a good JIRA to create - we can move the conversation
> > there to try to get the right people involved.
> >
> > Kevin Risden
> >
> >
> > On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith <jas2...@cornell.edu> wrote:
> >
> > > Hi Susheel,
> > >
> > >      Yes, it appears that under certain conditions, if a follower is down
> > > when the leader gets an update, the follower will not receive that update
> > > when it comes back (or maybe it receives the update and it's then
> > > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > > that follower then becomes the leader, it will replicate its own out of
> > > date value back to the former leader, even though the version number is
> > > lower.
> > >
> > >
> > >    -Jeremy
> > >
> > > ________________________________
> > > From: Susheel Kumar <susheel2...@gmail.com>
> > > Sent: Thursday, November 1, 2018 2:57:00 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud Replication Failure
> > >
> > > Are we saying it has to do something with stop and restarting replica's
> > > otherwise I haven't seen/heard any issues with document updates and
> > > forwarding to replica's...
> > >
> > > Thanks,
> > > Susheel
> > >
> > > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <erickerick...@gmail.com>
> > > wrote:
> > >
> > > > So  this seems like it absolutely needs a JIRA....
> > > > On Thu, Nov 1, 2018 at 9:39 AM
> > > Kevin Risden
> > > <kris...@apache.org> wrote:
> > > > >
> > > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > > locally
> > > > > without docker. I still see the same behavior where the latest updates
> > > > > aren't on the replicas. I still don't know what is happening but it
> > > > happens
> > > > > without Docker :(
> > > > >
> > > > >
> > > >
> > > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <kris...@apache.org>
> > > wrote:
> > > > >
> > > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > > fails
> > > > > > without Docker.
> > > > > >
> > > > > > Kevin Risden
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > > erickerick...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Kevin:
> > > > > >>
> > > > > >> You're also using Docker, right? Docker is not "officially"
> > > supported
> > > > > >> although there's some movement in that direction and if this is 
> > > > > >> only
> > > > > >> reproducible in Docker than it's a clue where to look....
> > > > > >>
> > > > > >> Erick
> > > > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > > > >> Kevin Risden
> > > > > >> <kris...@apache.org> wrote:
> > > > > >> >
> > > > > >> > I haven't dug into why this is happening but it definitely
> > > > reproduces. I
> > > > > >> > removed the local requirements (port mapping and such) from the
> > > > gist you
> > > > > >> > posted (very helpful). I confirmed this fails locally and on
> > > Travis
> > > > CI.
> > > > > >> >
> > > > > >> >
> > > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > > > >> >
> > > > > >> > I don't even see the first update getting applied from num 10 ->
> > > 20.
> > > > > >> After
> > > > > >> > the first update there is no more change.
> > > > > >> >
> > > > > >> > Kevin Risden
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <jas2...@cornell.edu
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Thanks Erick, this is 7.5.0.
> > > > > >> > > ________________________________
> > > > > >> > > From: Erick Erickson <erickerick...@gmail.com>
> > > > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > > > >> > > To: solr-user
> > > > > >> > > Subject: Re: SolrCloud Replication Failure
> > > > > >> > >
> > > > > >> > > What version of solr? This code was pretty much rewriten in 7.3
> > > > IIRC
> > > > > >> > >
> > > > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <jas2...@cornell.edu
> > > > wrote:
> > > > > >> > >
> > > > > >> > > > Hi all,
> > > > > >> > > >
> > > > > >> > > >      We are currently running a moderately large instance of
> > > > > >> standalone
> > > > > >> > > > solr and are preparing to switch to solr cloud to help us
> > > scale
> > > > > >> up.  I
> > > > > >> > > have
> > > > > >> > > > been running a number of tests using docker locally and ran
> > > > into an
> > > > > >> issue
> > > > > >> > > > where replication is consistently failing.  I have pared down
> > > > the
> > > > > >> test
> > > > > >> > > case
> > > > > >> > > > as minimally as I could.  Here's a link for the
> > > > docker-compose.yml
> > > > > >> (I put
> > > > > >> > > > it in a directory called solrcloud_simple) and a script to 
> > > > > >> > > > run
> > > > the
> > > > > >> test:
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > Here's the basic idea behind the test:
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 
> > > > > >> > > > shard,
> > > > and 2
> > > > > >> > > > replicas (each node gets a replica).  Just use the default
> > > > schema,
> > > > > >> > > although
> > > > > >> > > > I've also tried our schema and got the same result.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 2) Shut down solr-2
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 3) Add 100 simple docs, just id and a field called num.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 4) Start solr-2 and check that it received the documents.  It
> > > > did!
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 5) Update a document, commit, and check that solr-2 received
> > > the
> > > > > >> update.
> > > > > >> > > > It did!
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 6) Stop solr-2, update the same document, start solr-2, and
> > > make
> > > > > >> sure
> > > > > >> > > that
> > > > > >> > > > it received the update.  It did!
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts
> > > > back
> > > > > >> to what
> > > > > >> > > > it had in step 5.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > I believe the main issue comes from this in the logs:
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > > > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > > > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) 
> > > > > >> > > > [c:test
> > > > > >> s:shard1
> > > > > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync
> > > > PeerSync:
> > > > > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> > > > > >> versions
> > > > > >> > > are
> > > > > >> > > > newer. ourHighThreshold=1615861330901729280
> > > > > >> > > > otherLowThreshold=1615861314086764545
> > > > ourHighest=1615861330901729280
> > > > > >> > > > otherHighest=1615861335081353216
> > > > > >> > > >
> > > > > >> > > > PeerSync thinks the versions on solr-2 are newer for some
> > > > reason,
> > > > > >> so it
> > > > > >> > > > doesn't try to sync from solr-1.  In the final state, solr-2
> > > > will
> > > > > >> always
> > > > > >> > > > have a lower version for the updated doc than solr-1.  I've
> > > > tried
> > > > > >> this
> > > > > >> > > with
> > > > > >> > > > different commit strategies, both auto and manual, and it
> > > > doesn't
> > > > > >> seem to
> > > > > >> > > > make any difference.
> > > > > >> > > >
> > > > > >> > > > Is this a bug with solr, an issue with using docker, or am I
> > > > just
> > > > > >> > > > expecting too much from solr?
> > > > > >> > > >
> > > > > >> > > > Thanks for any insights you may have,
> > > > > >> > > >
> > > > > >> > > > Jeremy
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >>
> > > > > >
> > > >
> > >

Reply via email to