Re: Cascading cluster failure

vineeth mohan Wed, 24 Dec 2014 03:20:08 -0800

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the
time this anomaly happens.
Chances are that the stop the world time might block the ping for sometime
and the cluster might feel some nodes are gone.


Thanks
          Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <abhishek...@gmail.com
> wrote:

> Hi all,
>
> We recently had a cascading cluster failure. From 16:35 to 16:42 the
> cluster went red and recovered it self. I can't seem to find any obvious
> logs around this time.
>
> The cluster has about 19 nodes. 9 physical boxes running two instances of
> elasticsearch. And one vm as balancer for indexing.  The CPU is normal and
> memory usage is below 75%
>
>
> <https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png>
>
> Heap during the outage
>
>
>
> <https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png>
>
> Heap once stable.
>
>
> <https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png>
>
>
> * Below are list of events that happened according to marvel :*
>
> 2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined
>
> 2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left
>
> 2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
> became master
>
> 2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined
>
> 2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined
>
> 2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left
>
> 2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
> became master
>
> 2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left
>
> 2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left
>
> 2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left
>
> 2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became master
>
> 2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left
>
> 2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left
>
> 2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became master
>
> 2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left
>
> *All I can find are some info logs, when the master is elected and
> "reason: zen-disco-master_failed"*
>
> [2014-12-23 17:32:27,668][INFO ][cluster.service          ] [E0007-1]
> master {new
> [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
> max_local_storage_nodes=2, master=true}, previous
> [E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
> max_local_storage_nodes=2, master=true}}, removed
> {[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
> max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
> ([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
> max_local_storage_nodes=2, master=true})
>
> *I couldn't find any other errors or warnings around this time. All I can
> find is OOM errors which I found that is also happening before. *
>
> *I found similar logs in all the nodes just before the node left : *
>
> [2014-12-23 17:38:20,117][WARN ][index.translog           ] [E0007-1]
> [xxxx70246][10] failed to flush shard on translog threshold
>
> org.elasticsearch.index.engine.FlushFailedEngineException:
> [xxxx10170246][10] Flush failed
>
> at
> org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)
>
> at
> org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)
>
> at
> org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.IllegalStateException: this writer hit an
> OutOfMemoryError; cannot commit
>
> at
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)
>
>  at
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)
>
> *Also, found some transport expections, which are not new *
>
> 2014-12-23 17:37:52,328][WARN ][search.action            ] [E0007-1]
> Failed to send release search context
>
> org.elasticsearch.transport.SendRequestTransportException:
> [E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]
>
> at
> org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)
>
> at
> org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)
>
> at
> org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)
>
> at
> org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)
>
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)
>
> at org.elasticsearch.action.search.type.TransportSea
>
> *The cluster recovered after the 7 minutes and is back up and green. Can
> these errors cause the nodes to not respond making the cluster think the
> node is dead and elect a new master and so forth ? If not I was wondering
> If I can get some pointers on where to look ? Or what might have happened.*
>
> Thanks,
>
> Abhishek
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Cascading cluster failure

Reply via email to