Thanks, Mark. Yes I keep track of the overseer and restart it in the end.
The only thing that i observe is that as the zookeeper cluster state file
grows, this behavior gets worse. I notice the following issues

   1. Two nodes (different replicas for the same shard) get stuck in
   recovering state without either becoming a leader. I thought zk was meant
   to break ties but doesnt help
   2. If the recovery fails on a replica, it gets stuck retrying for a very
   long time (in the order of tens of minutes) before it finally giving
   up/recovering
   3. There have been cases 1000 collections restart successfully but takes
   over 2 hours (because of #2)

The cluster state json file is continuously being updated as the cluster
restarts (to update core status). Has anyone see this being a big
bottleneck? Does zookeeper locking files for writes cause a huge issue
while restarting solr?

Also a side question: Why do we need to have a global cluster state json?
Is it better to break it down to a per collection state json file?

Thanks for all your help!
Nitin




On Wed, Aug 13, 2014 at 9:15 AM, Mark Miller <markrmil...@gmail.com> wrote:

> That is good testing :) We should track down what is up with that 30%.
> Might open a JIRA with some logs.
>
> It can help if you restart the overseer node last.
>
> There are likely some improvements around this post 4.6.
>
> --
> Mark Miller
> about.me/markrmiller
>
> On August 13, 2014 at 12:05:27 PM, KNitin (nitin.t...@gmail.com) wrote:
> > Thank u all! Yes I want to disable it for testing purposes
> >
> > The main issue is that rolling restart of solrcloud for 1000 collections
> is
> > extremely unreliable and slow. More than 30% of the collections fail to
> > recover.
> >
> > What are some good guidelines to follow while restarting a massive
> cluster
> > like this ?
> >
> > Are there any new improvements (post 4.6) in solr that helps restarts to
> be
> > more robust ?
> >
> > Thanks
> >
> > On Sunday, August 10, 2014, rulinma wrote:
> >
> > > good.
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> http://lucene.472066.n3.nabble.com/Disabling-transaction-logs-tp4151721p4152222.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
>
>

Reply via email to