The GC logs don't really show anything interesting, there would
be 15+ second GC pauses. The Zookeeper log isn't actually very
interesting. As far as OOM errors, I was thinking of _solr_ logs.

As to why the cluster doesn't self-heal, a couple of things:

1> Once you hit an OOM, all bets are off. The JVM needs to be
bounced. Many installations have kill scripts that bounce the
JVM. So it's explainable if you have OOM errors.

2> The system may be _trying_ to recover, but if you're
still ingesting data it may get into a resource-starved
situation where it makes progress but never catches up.

Again, though, this seems like very little memory for the
situation you describe, I suspect you're memory-starved to
a point where you can't really run. But that's a guess.

When you run, how much JVM memory are you using? The admin
UI should show that.

But the pattern of 8G physical memory and 6G for Java is a red
flag as per Uwe's blog post, you may be swapping a lot (OS
memory) and that may be slowing things down enough to have
sessions drop. Grasping at straws here, but "top" or similar
should tell you what the system is doing.

Best,
Erick

On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser <bjoernhaeu...@gmail.com> wrote:
> Hi!
>
> Thank you for your super fast answer.
>
> I can provide more data, the question is which data :-)
>
> These are the config parameters solr runs with:
> https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
> the admin ui)
>
> These are the log files:
>
> https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b
>
> I think your first obversation is correct: SolrCloud looses the
> connection to zookeeper, because the connection times out.
>
> But why isn't solrcloud able to recover it self?
>
> Thanks
> Björn
>
>
> 2015-11-02 22:32 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
>> Without more data, I'd guess one of two things:
>>
>> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
>> think the node is unresponsive, which puts a node into recovery and
>> things go bad from there.
>>
>> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
>> also cascade a bunch of problems.
>>
>> In general it's an anti-pattern to allocate such a large portion of
>> our physical memory to the JVM, see:
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>
>>
>>
>> Best,
>> Erick
>>
>>
>>
>> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <bjoernhaeu...@gmail.com> wrote:
>>> Hey there,
>>>
>>> we are running a SolrCloud, which has 4 nodes, same config. Each node
>>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
>>> worked for a long time.
>>>
>>> We currently run with 2 shards, 2 replicas and 11 collections. The
>>> complete data-dir is about 5.3 GB.
>>> I think we should move some JVM heap back to the OS.
>>>
>>> We are running Solr 5.2.1., as I could not see any related bugs to
>>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
>>> to upgrade first.
>>>
>>> One of our nodes (node A) reports these errors:
>>>
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
>>> version (expected 2, but 101) or the data in not in 'javabin' format
>>>
>>> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>>>
>>> And shortly after (4 seconds) this happens on a *different* node (Node B):
>>>
>>> Stopping recovery for core=suggestion coreNodeName=core_node2
>>>
>>> No Stacktrace for this, but happens for all 11 collections.
>>>
>>> 6 seconds after that Node C reports these errors:
>>>
>>> org.apache.solr.common.SolrException:
>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired for /configs/customers/params.json
>>>
>>> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>>>
>>> This also happens for 11 collections.
>>>
>>> And then different errors happen:
>>>
>>> OverseerAutoReplicaFailoverThread had an error in its thread work
>>> loop.:org.apache.solr.common.SolrException: Error reading cluster
>>> properties
>>>
>>> cancelElection did not find election node to remove
>>> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>>>
>>> At that point the cluster is broken and stops responding to the most
>>> queries. In the same time zookeeper looks okay.
>>>
>>> The cluster cannot selfheal from that situation and we are forced to
>>> take manual action and restart node after node and hope that solrcloud
>>> eventually recovers. Which sometimes takes several minutes and several
>>> restarts from various nodes.
>>>
>>> We can provide more logdata if needed.
>>>
>>> Is there anything where we can start digging to find the underlying
>>> error for that problem?
>>>
>>> Thanks in advance
>>> Björn

Reply via email to