Re: SolrCloud breaks and does not recover

Pushkar Raste Sat, 07 Nov 2015 05:45:07 -0800

HI,
To minimize GC pauses, try using G1GC and turn on 'ParallelRefProcEnabled'
jvm flag. G1GC works much better for heaps > 4 GB. Lowering
'InitiatingHeapOccupancyPercent'
will also help to avoid long GC pauses at the cost of more short pauses.


On 3 November 2015 at 12:12, Björn Häuser <bjoernhaeu...@gmail.com> wrote:

> Hi,
>
> thank you for your answer.
>
> 1> No OOM hit, the log does not contain any hind of that. Also solr
> wasn't restarted automatically. But the gc log has some pauses which
> are longer than 15 seconds.
>
> 2> So, if we need to recover a system we need to stop ingesting data into
> it?
>
> 3> The JVMs currently use a little bit more then 1GB of Heap, with a
> now changed max-heap of 3GB. Currently thinking of lowering the heap
> to 1.5 / 2 GB (following Uwe's post).
>
> Also the RES is 4.1gb and VIRT is 12.5gb. Swap is more or less not
> used (40mb of 1GB assigned swap). According to our server monitoring
> sometimes an io spike happens, but again not that much.
>
> What I am going todo:
>
> 1.) make sure that in case of failure we stop ingesting data into solrcloud
> 2.) lower the heap to 2GB
> 3.) Make sure that zookeeper can fsync its write-ahead log fast enough (<1
> sec)
>
> Thanks
> Björn
>
> 2015-11-03 16:27 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
> > The GC logs don't really show anything interesting, there would
> > be 15+ second GC pauses. The Zookeeper log isn't actually very
> > interesting. As far as OOM errors, I was thinking of _solr_ logs.
> >
> > As to why the cluster doesn't self-heal, a couple of things:
> >
> > 1> Once you hit an OOM, all bets are off. The JVM needs to be
> > bounced. Many installations have kill scripts that bounce the
> > JVM. So it's explainable if you have OOM errors.
> >
> > 2> The system may be _trying_ to recover, but if you're
> > still ingesting data it may get into a resource-starved
> > situation where it makes progress but never catches up.
> >
> > Again, though, this seems like very little memory for the
> > situation you describe, I suspect you're memory-starved to
> > a point where you can't really run. But that's a guess.
> >
> > When you run, how much JVM memory are you using? The admin
> > UI should show that.
> >
> > But the pattern of 8G physical memory and 6G for Java is a red
> > flag as per Uwe's blog post, you may be swapping a lot (OS
> > memory) and that may be slowing things down enough to have
> > sessions drop. Grasping at straws here, but "top" or similar
> > should tell you what the system is doing.
> >
> > Best,
> > Erick
> >
> > On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser <bjoernhaeu...@gmail.com>
> wrote:
> >> Hi!
> >>
> >> Thank you for your super fast answer.
> >>
> >> I can provide more data, the question is which data :-)
> >>
> >> These are the config parameters solr runs with:
> >> https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
> >> the admin ui)
> >>
> >> These are the log files:
> >>
> >> https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b
> >>
> >> I think your first obversation is correct: SolrCloud looses the
> >> connection to zookeeper, because the connection times out.
> >>
> >> But why isn't solrcloud able to recover it self?
> >>
> >> Thanks
> >> Björn
> >>
> >>
> >> 2015-11-02 22:32 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
> >>> Without more data, I'd guess one of two things:
> >>>
> >>> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
> >>> think the node is unresponsive, which puts a node into recovery and
> >>> things go bad from there.
> >>>
> >>> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
> >>> also cascade a bunch of problems.
> >>>
> >>> In general it's an anti-pattern to allocate such a large portion of
> >>> our physical memory to the JVM, see:
> >>>
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >>>
> >>>
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>
> >>>
> >>> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <bjoernhaeu...@gmail.com>
> wrote:
> >>>> Hey there,
> >>>>
> >>>> we are running a SolrCloud, which has 4 nodes, same config. Each node
> >>>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
> >>>> worked for a long time.
> >>>>
> >>>> We currently run with 2 shards, 2 replicas and 11 collections. The
> >>>> complete data-dir is about 5.3 GB.
> >>>> I think we should move some JVM heap back to the OS.
> >>>>
> >>>> We are running Solr 5.2.1., as I could not see any related bugs to
> >>>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
> >>>> to upgrade first.
> >>>>
> >>>> One of our nodes (node A) reports these errors:
> >>>>
> >>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >>>> Error from server at http://10.41.199.201:9004/solr/catalogue:
> Invalid
> >>>> version (expected 2, but 101) or the data in not in 'javabin' format
> >>>>
> >>>> Stacktrace:
> https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
> >>>>
> >>>> And shortly after (4 seconds) this happens on a *different* node
> (Node B):
> >>>>
> >>>> Stopping recovery for core=suggestion coreNodeName=core_node2
> >>>>
> >>>> No Stacktrace for this, but happens for all 11 collections.
> >>>>
> >>>> 6 seconds after that Node C reports these errors:
> >>>>
> >>>> org.apache.solr.common.SolrException:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /configs/customers/params.json
> >>>>
> >>>> Stacktrace:
> https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
> >>>>
> >>>> This also happens for 11 collections.
> >>>>
> >>>> And then different errors happen:
> >>>>
> >>>> OverseerAutoReplicaFailoverThread had an error in its thread work
> >>>> loop.:org.apache.solr.common.SolrException: Error reading cluster
> >>>> properties
> >>>>
> >>>> cancelElection did not find election node to remove
> >>>>
> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
> >>>>
> >>>> At that point the cluster is broken and stops responding to the most
> >>>> queries. In the same time zookeeper looks okay.
> >>>>
> >>>> The cluster cannot selfheal from that situation and we are forced to
> >>>> take manual action and restart node after node and hope that solrcloud
> >>>> eventually recovers. Which sometimes takes several minutes and several
> >>>> restarts from various nodes.
> >>>>
> >>>> We can provide more logdata if needed.
> >>>>
> >>>> Is there anything where we can start digging to find the underlying
> >>>> error for that problem?
> >>>>
> >>>> Thanks in advance
> >>>> Björn
>

Re: SolrCloud breaks and does not recover

Reply via email to