Re: Sync failure after shard leader election when adding new replica.

Erick Erickson Tue, 26 May 2015 18:08:14 -0700

Please, please, please do _not_ try to use core discovery to add new
replicas by manually editing stuff.

bq: and my deployment tools create an empty core on newly provisioned machines.

This is a really bad idea (as you have discovered).  Basically, your
deployment tools have to do everything right to get this to "play
nice" with SolrCloud. Your core names can't conflict. You have to
spell all the parameters in core.properties right. Etc. There are
endless places to go wrong. And this is all done for you (and tested
with unit tests) via the Collections API.

Assuming that in your scenario you started machine2 before machine1,
how would Solr have any clue that that machine1 would _ever_ come back
up? It'll do the best it can and try to elect a leader, but there's
only one machine to choose from... and it's sorely out of date....

Absolutely use the collections api to add replicas to running
SolrCloud clusters. And adding a replica via the Collections API
_will_ use core discovery, as in it'll cause a core.properties file to
be written on the node in question, populate it with all the necessary
parameters, initiate a synch from the (running) leader, put itself
into the query rotation automatically when the sync is done etc. All
without you
1> having to try to figure all this out yourself
2> take the collection offline

Best,
Erick

On Tue, May 26, 2015 at 2:46 PM, Michael Roberts <mrobe...@tableau.com> wrote:
> Hi,
>
> I have a SolrCloud setup, running 4.10.3. The setup consists of several 
> cores, each with a single shard and initially each shard has a single replica 
> (so, basically, one machine). I am using core discovery, and my deployment 
> tools create an empty core on newly provisioned machines.
>
> The scenario that I am testing is, Machine 1 is running and writes are 
> occurring from my application to Solr. At some point, I stop Machine 1, and 
> reconfigure my application to add Machine 2. Both machines are then started.
>
> What I would expect to happen at this point, is Machine 2 cannot become 
> leader because it is behind compared to Machine 1. Machine 2 would then 
> restore from Machine 1.
>
> However, looking at the logs. I am seeing Machine 2 become elected leader and 
> fail the PeerRestore
>
> 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to 
> continue.
> 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - 
> try and sync
> 2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
> org.apache.solr.update.PeerSync - PeerSync: core=project 
> url=http://10.32.132.64:11000/solr START 
> replicas=[http://jchar-1:11000/solr/project/] nUpdates=100
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
> org.apache.solr.update.PeerSync - PeerSync: core=project 
> url=http://10.32.132.64:11000/solr DONE.  We have no versions.  sync failed.
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we 
> have no versions - we can't sync in that case - we were active before, so 
> become leader anyway
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: 
> http://10.32.132.64:11000/solr/project/ shard1
>
> What is the expected behavior here? What’s the best practice for adding a new 
> replica? Should I have the SolrCloud running and do it via the Collections 
> API or can I continue to use core discovery?
>
> Thanks.
>
>

Re: Sync failure after shard leader election when adding new replica.

Reply via email to