node failures

Kireet Reddy Tue, 10 Jun 2014 07:42:34 -0700

On our 4 node test cluster (1.1.2), seemingly out of the blue we had one 
node experience very high cpu usage and become unresponsive and then after 
about 8 hours another node experienced the same issue. The processes 
themselves stayed alive, gc activity was normal, they didn't experience an 
OutOfMemoryError. The nodes left the cluster though, perhaps due to the 
unresponsiveness. The only errors in the log files were a bunch of messages 
like:


org.elasticsearch.search.SearchContextMissingException: No search context 
found for id ...

and errors about the search queue being full. We see the 
SearchContextMissingException occasionally during normal operation, but 
during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time 
because the other 2 nodes had normal cpu usage and for the prior week 
things ran smoothly.

We are going to restart testing, but is there anything we can do to better 
understand what happened? Maybe change a particular log level or do 
something while the problem is happening, assuming we can reproduce the 
issue?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

node failures

Reply via email to