Hello Abhishek , Can you try to correlate merge operation of shards and this time of cascading failures ? I feel there is a correlation between both. If so , we can do some optimization on that side.
Thanks Vineeth On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu <abhishek...@gmail.com > wrote: > Mark, > > Thanks for reading. Our heap sizes are less than 32 gigs to avoid > uncompressed pointers. We ideally double our cluster every year the number > of shards is plan for future growth. And the way documents are spread > across all the nodes in the cluster etc.. > > Thanks, > Abhishek > > On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote: >> >> That's a pretty big number of shards, why is it so high? >> The recommended there is one shard per node, so you should (ideally) have >> closer to 6600 shards. >> >> On 25 December 2014 at 07:07, Pat Wright <sqla...@gmail.com> wrote: >> >>> Mark, >>> >>> I work on the cluster as well so i can answer the size/makeup. >>> Data: 580GB >>> Shards: 10K >>> Indices: 347 >>> ES version: 1.3.2 >>> >>> Not sure the Java version. >>> >>> Thanks for getting back! >>> >>> pat >>> >>> >>> On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote: >>>> >>>> You should drop your heap to 31GB, over that and you lose some >>>> performance and actual heap stack due to uncompressed pointers. >>>> >>>> it looks like a node, or nodes, dropped out due to GC. How much data, >>>> how many indexes do you have? What ES and java versions? >>>> >>>> >>>> On 24 December 2014 at 22:29, Abhishek <abhis...@gmail.com> wrote: >>>> >>>>> Thanks for reading vineeth. That was my initial thought but I couldn't >>>>> find any old gc during the outage. Each es node has 32 gigs. Each box has >>>>> 128gigs split between 2 es nodes(32G each) and file system cache (64G). >>>>> >>>>> On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan <vm.vine...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi , >>>>>> >>>>>> What is the memory for each of these machines ? >>>>>> Also see if there is any correlation between garbage collection and >>>>>> the time this anomaly happens. >>>>>> Chances are that the stop the world time might block the ping for >>>>>> sometime and the cluster might feel some nodes are gone. >>>>>> >>>>>> Thanks >>>>>> Vineeth >>>>>> >>>>>> On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu < >>>>>> abhis...@gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> We recently had a cascading cluster failure. From 16:35 to 16:42 the >>>>>>> cluster went red and recovered it self. I can't seem to find any obvious >>>>>>> logs around this time. >>>>>>> >>>>>>> The cluster has about 19 nodes. 9 physical boxes running two >>>>>>> instances of elasticsearch. And one vm as balancer for indexing. The >>>>>>> CPU >>>>>>> is normal and memory usage is below 75% >>>>>>> >>>>>>> >>>>>>> <https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png> >>>>>>> >>>>>>> Heap during the outage >>>>>>> >>>>>>> >>>>>>> >>>>>>> <https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png> >>>>>>> >>>>>>> Heap once stable. >>>>>>> >>>>>>> >>>>>>> <https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png> >>>>>>> >>>>>>> >>>>>>> * Below are list of events that happened according to marvel :* >>>>>>> >>>>>>> 2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] >>>>>>> joined >>>>>>> >>>>>>> 2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left >>>>>>> >>>>>>> 2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX] >>>>>>> became master >>>>>>> >>>>>>> 2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] >>>>>>> joined >>>>>>> >>>>>>> 2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] >>>>>>> joined >>>>>>> >>>>>>> 2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left >>>>>>> >>>>>>> 2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX] >>>>>>> became master >>>>>>> >>>>>>> 2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left >>>>>>> >>>>>>> 2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left >>>>>>> >>>>>>> 2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left >>>>>>> >>>>>>> 2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became >>>>>>> master >>>>>>> >>>>>>> 2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left >>>>>>> >>>>>>> 2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left >>>>>>> >>>>>>> 2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became >>>>>>> master >>>>>>> >>>>>>> 2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left >>>>>>> >>>>>>> *All I can find are some info logs, when the master is elected and >>>>>>> "reason: zen-disco-master_failed"* >>>>>>> >>>>>>> [2014-12-23 17:32:27,668][INFO ][cluster.service ] >>>>>>> [E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007] >>>>>>> [inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2, >>>>>>> master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaO >>>>>>> Xg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2, >>>>>>> master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBa >>>>>>> OXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2, >>>>>>> master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_ >>>>>>> QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012, >>>>>>> max_local_storage_nodes=2, master=true}) >>>>>>> >>>>>>> *I couldn't find any other errors or warnings around this time. All >>>>>>> I can find is OOM errors which I found that is also happening before. * >>>>>>> >>>>>>> *I found similar logs in all the nodes just before the node left : * >>>>>>> >>>>>>> [2014-12-23 17:38:20,117][WARN ][index.translog ] >>>>>>> [E0007-1] [xxxx70246][10] failed to flush shard on translog threshold >>>>>>> >>>>>>> org.elasticsearch.index.engine.FlushFailedEngineException: >>>>>>> [xxxx10170246][10] Flush failed >>>>>>> >>>>>>> at org.elasticsearch.index.engine.internal.InternalEngine.flush( >>>>>>> InternalEngine.java:868) >>>>>>> >>>>>>> at org.elasticsearch.index.shard.service.InternalIndexShard.flu >>>>>>> sh(InternalIndexShard.java:609) >>>>>>> >>>>>>> at org.elasticsearch.index.translog.TranslogService$TranslogBas >>>>>>> edFlush$1.run(TranslogService.java:201) >>>>>>> >>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>>>>> Executor.java:1145) >>>>>>> >>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>>>>> lExecutor.java:615) >>>>>>> >>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>> >>>>>>> Caused by: java.lang.IllegalStateException: this writer hit an >>>>>>> OutOfMemoryError; cannot commit >>>>>>> >>>>>>> at org.apache.lucene.index.IndexWriter.prepareCommitInternal(In >>>>>>> dexWriter.java:2941) >>>>>>> >>>>>>> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWrit >>>>>>> er.java:3122) >>>>>>> >>>>>>> *Also, found some transport expections, which are not new * >>>>>>> >>>>>>> 2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1] >>>>>>> Failed to send release search context >>>>>>> >>>>>>> org.elasticsearch.transport.SendRequestTransportException: >>>>>>> [E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/fr >>>>>>> eeContext] >>>>>>> >>>>>>> at org.elasticsearch.transport.TransportService.sendRequest(Tra >>>>>>> nsportService.java:220) >>>>>>> >>>>>>> at org.elasticsearch.transport.TransportService.sendRequest(Tra >>>>>>> nsportService.java:190) >>>>>>> >>>>>>> at org.elasticsearch.search.action.SearchServiceTransportAction. >>>>>>> sendFreeContext(SearchServiceTransportAction.java:125) >>>>>>> >>>>>>> at org.elasticsearch.action.search.type.TransportSearchTypeAction$ >>>>>>> BaseAsyncAction.releaseIrrelevantSearchContexts( >>>>>>> TransportSearchTypeAction.java:348) >>>>>>> >>>>>>> at org.elasticsearch.action.search.type.TransportSearchQueryThe >>>>>>> nFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchA >>>>>>> ction.java:147) >>>>>>> >>>>>>> at org.elasticsearch.action.search.type.TransportSea >>>>>>> >>>>>>> *The cluster recovered after the 7 minutes and is back up and green. >>>>>>> Can these errors cause the nodes to not respond making the cluster think >>>>>>> the node is dead and elect a new master and so forth ? If not I was >>>>>>> wondering If I can get some pointers on where to look ? Or what might >>>>>>> have >>>>>>> happened.* >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Abhishek >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/832af056-f21 >>>>>>> c-438c-97be-3794e97549b1%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to a topic in >>>>>> the Google Groups "elasticsearch" group. >>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>>>>> pic/elasticsearch/_S728XU6jks/unsubscribe. >>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>> elasticsearc...@googlegroups.com. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DT >>>>>> g6e17uKw5%2Brtw%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to elasticsearc...@googlegroups.com. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2 >>>>> BfOWzqESa5Xkw%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearc...@googlegroups.com. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.