HI, To minimize GC pauses, try using G1GC and turn on 'ParallelRefProcEnabled' jvm flag. G1GC works much better for heaps > 4 GB. Lowering 'InitiatingHeapOccupancyPercent' will also help to avoid long GC pauses at the cost of more short pauses.
On 3 November 2015 at 12:12, Björn Häuser <bjoernhaeu...@gmail.com> wrote: > Hi, > > thank you for your answer. > > 1> No OOM hit, the log does not contain any hind of that. Also solr > wasn't restarted automatically. But the gc log has some pauses which > are longer than 15 seconds. > > 2> So, if we need to recover a system we need to stop ingesting data into > it? > > 3> The JVMs currently use a little bit more then 1GB of Heap, with a > now changed max-heap of 3GB. Currently thinking of lowering the heap > to 1.5 / 2 GB (following Uwe's post). > > Also the RES is 4.1gb and VIRT is 12.5gb. Swap is more or less not > used (40mb of 1GB assigned swap). According to our server monitoring > sometimes an io spike happens, but again not that much. > > What I am going todo: > > 1.) make sure that in case of failure we stop ingesting data into solrcloud > 2.) lower the heap to 2GB > 3.) Make sure that zookeeper can fsync its write-ahead log fast enough (<1 > sec) > > Thanks > Björn > > 2015-11-03 16:27 GMT+01:00 Erick Erickson <erickerick...@gmail.com>: > > The GC logs don't really show anything interesting, there would > > be 15+ second GC pauses. The Zookeeper log isn't actually very > > interesting. As far as OOM errors, I was thinking of _solr_ logs. > > > > As to why the cluster doesn't self-heal, a couple of things: > > > > 1> Once you hit an OOM, all bets are off. The JVM needs to be > > bounced. Many installations have kill scripts that bounce the > > JVM. So it's explainable if you have OOM errors. > > > > 2> The system may be _trying_ to recover, but if you're > > still ingesting data it may get into a resource-starved > > situation where it makes progress but never catches up. > > > > Again, though, this seems like very little memory for the > > situation you describe, I suspect you're memory-starved to > > a point where you can't really run. But that's a guess. > > > > When you run, how much JVM memory are you using? The admin > > UI should show that. > > > > But the pattern of 8G physical memory and 6G for Java is a red > > flag as per Uwe's blog post, you may be swapping a lot (OS > > memory) and that may be slowing things down enough to have > > sessions drop. Grasping at straws here, but "top" or similar > > should tell you what the system is doing. > > > > Best, > > Erick > > > > On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser <bjoernhaeu...@gmail.com> > wrote: > >> Hi! > >> > >> Thank you for your super fast answer. > >> > >> I can provide more data, the question is which data :-) > >> > >> These are the config parameters solr runs with: > >> https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from > >> the admin ui) > >> > >> These are the log files: > >> > >> https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b > >> > >> I think your first obversation is correct: SolrCloud looses the > >> connection to zookeeper, because the connection times out. > >> > >> But why isn't solrcloud able to recover it self? > >> > >> Thanks > >> Björn > >> > >> > >> 2015-11-02 22:32 GMT+01:00 Erick Erickson <erickerick...@gmail.com>: > >>> Without more data, I'd guess one of two things: > >>> > >>> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to > >>> think the node is unresponsive, which puts a node into recovery and > >>> things go bad from there. > >>> > >>> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can > >>> also cascade a bunch of problems. > >>> > >>> In general it's an anti-pattern to allocate such a large portion of > >>> our physical memory to the JVM, see: > >>> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > >>> > >>> > >>> > >>> Best, > >>> Erick > >>> > >>> > >>> > >>> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <bjoernhaeu...@gmail.com> > wrote: > >>>> Hey there, > >>>> > >>>> we are running a SolrCloud, which has 4 nodes, same config. Each node > >>>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but > >>>> worked for a long time. > >>>> > >>>> We currently run with 2 shards, 2 replicas and 11 collections. The > >>>> complete data-dir is about 5.3 GB. > >>>> I think we should move some JVM heap back to the OS. > >>>> > >>>> We are running Solr 5.2.1., as I could not see any related bugs to > >>>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother > >>>> to upgrade first. > >>>> > >>>> One of our nodes (node A) reports these errors: > >>>> > >>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > >>>> Error from server at http://10.41.199.201:9004/solr/catalogue: > Invalid > >>>> version (expected 2, but 101) or the data in not in 'javabin' format > >>>> > >>>> Stacktrace: > https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171 > >>>> > >>>> And shortly after (4 seconds) this happens on a *different* node > (Node B): > >>>> > >>>> Stopping recovery for core=suggestion coreNodeName=core_node2 > >>>> > >>>> No Stacktrace for this, but happens for all 11 collections. > >>>> > >>>> 6 seconds after that Node C reports these errors: > >>>> > >>>> org.apache.solr.common.SolrException: > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired for /configs/customers/params.json > >>>> > >>>> Stacktrace: > https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8 > >>>> > >>>> This also happens for 11 collections. > >>>> > >>>> And then different errors happen: > >>>> > >>>> OverseerAutoReplicaFailoverThread had an error in its thread work > >>>> loop.:org.apache.solr.common.SolrException: Error reading cluster > >>>> properties > >>>> > >>>> cancelElection did not find election node to remove > >>>> > /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112 > >>>> > >>>> At that point the cluster is broken and stops responding to the most > >>>> queries. In the same time zookeeper looks okay. > >>>> > >>>> The cluster cannot selfheal from that situation and we are forced to > >>>> take manual action and restart node after node and hope that solrcloud > >>>> eventually recovers. Which sometimes takes several minutes and several > >>>> restarts from various nodes. > >>>> > >>>> We can provide more logdata if needed. > >>>> > >>>> Is there anything where we can start digging to find the underlying > >>>> error for that problem? > >>>> > >>>> Thanks in advance > >>>> Björn >