Re: best practice for restarting the entire SolrCloud cluster

Bill Au Thu, 08 Nov 2012 13:17:22 -0800

My replicas are actually on different machines so they do come up.  The
problem I found is that since they can't get the leader they just come up
but is not part of the cluster.  I can still do local search with
distrib=false.  They do not retry to get the leader so I have to restarted
them after the leader has started in order to get them back into the
cluster.


Bill


On Thu, Nov 8, 2012 at 4:02 PM, Markus Jelsma <markus.jel...@openindex.io>wrote:

> Hi - i think you're seeing:
> https://issues.apache.org/jira/browse/SOLR-3993
>
>
> -----Original message-----
> > From:Bill Au <bill.w...@gmail.com>
> > Sent: Thu 08-Nov-2012 21:16
> > To: solr-user@lucene.apache.org
> > Subject: best practice for restarting the entire SolrCloud cluster
> >
> > I have a simple SolrCloud cluster with 4 Solr instances and 1 shard.  I
> can
> > start and stop individual Solr instances without any problem.  But not
> when
> > I have to shutdown all the Solr instances at the same time.
> >
> > After shutting down all the Solr instances, the first instance that
> starts
> > up wait for all the replicas:
> >
> > INFO: Waiting until we see more replicas up: total=4 found=3
> > timeoutin=169243
> >
> > In the meantime, any additional Solr instances that start up while the
> > first one is waiting can't get the leader from zookeeper:
> >
> > SEVERE: Error getting leader from zk
> > org.apache.solr.common.SolrException: Could not get leader props
> >
> > When the first Solr instance see all the replicas, it becomes the leader:
> >
> > INFO: Enough replicas found to continue.
> > INFO: I may be the new leader - try and sync
> >
> > But it fails to sync with the instances that had failed to get the leader
> > before:
> >
> > WARNING: PeerSync: core=collection1 url=http://host2:8983/solr exception
> > talking to http://host2:8983/solr/collection1/, failed
> > org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> > waiting response from server at: http://host2:8983/solr/collection1
> >
> > So I ended up with one for more replicas down after the restart.  I had
> to
> > figure out which replica is down and restart them.
> >
> > What I also discovered is that if I start the first Solr instance and
> wait
> > until it returns after the leaderVoteWait of 3 minutes, the rest of the
> > Solr instance can be started without any problem since by then they can
> get
> > the leader from zookeeper.
> >
> > Is there a better way to restart an entire SolrCloud cluster?
> >
> > Bill
> >
>

Re: best practice for restarting the entire SolrCloud cluster

Reply via email to