Re: Cascading cluster failure

vineeth mohan Thu, 25 Dec 2014 09:43:13 -0800

Hello Abhishek ,

Can you try to correlate merge operation of shards and this time of
cascading failures ?
I feel there is a correlation between both.
If so , we can do some optimization on that side.


Thanks
          Vineeth

On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu <abhishek...@gmail.com
> wrote:

> Mark,
>
> Thanks for reading. Our heap sizes are less than 32 gigs to avoid
> uncompressed pointers. We ideally double our cluster every year the number
> of shards is plan for future growth. And the way documents are spread
> across all the nodes in the cluster etc..
>
> Thanks,
> Abhishek
>
> On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:
>>
>> That's a pretty big number of shards, why is it so high?
>> The recommended there is one shard per node, so you should (ideally) have
>> closer to 6600 shards.
>>
>> On 25 December 2014 at 07:07, Pat Wright <sqla...@gmail.com> wrote:
>>
>>> Mark,
>>>
>>> I work on the cluster as well so i can answer the size/makeup.
>>> Data: 580GB
>>> Shards: 10K
>>> Indices: 347
>>> ES version: 1.3.2
>>>
>>> Not sure the Java version.
>>>
>>> Thanks for getting back!
>>>
>>> pat
>>>
>>>
>>> On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:
>>>>
>>>> You should drop your heap to 31GB, over that and you lose some
>>>> performance and actual heap stack due to uncompressed pointers.
>>>>
>>>> it looks like a node, or nodes, dropped out due to GC. How much data,
>>>> how many indexes do you have? What ES and java versions?
>>>>
>>>>
>>>> On 24 December 2014 at 22:29, Abhishek <abhis...@gmail.com> wrote:
>>>>
>>>>> Thanks for reading vineeth. That was my initial thought but I couldn't
>>>>> find any old gc during the outage. Each es node has 32 gigs. Each box has
>>>>> 128gigs split between 2 es nodes(32G each)  and file system cache (64G).
>>>>>
>>>>> On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan <vm.vine...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> What is the memory for each of these machines ?
>>>>>> Also see if there is any correlation between garbage collection and
>>>>>> the time this anomaly happens.
>>>>>> Chances are that the stop the world time might block the ping for
>>>>>> sometime and the cluster might feel some nodes are gone.
>>>>>>
>>>>>> Thanks
>>>>>>           Vineeth
>>>>>>
>>>>>> On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
>>>>>> abhis...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We recently had a cascading cluster failure. From 16:35 to 16:42 the
>>>>>>> cluster went red and recovered it self. I can't seem to find any obvious
>>>>>>> logs around this time.
>>>>>>>
>>>>>>> The cluster has about 19 nodes. 9 physical boxes running two
>>>>>>> instances of elasticsearch. And one vm as balancer for indexing.  The 
>>>>>>> CPU
>>>>>>> is normal and memory usage is below 75%
>>>>>>>
>>>>>>>
>>>>>>> <https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png>
>>>>>>>
>>>>>>> Heap during the outage
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> <https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png>
>>>>>>>
>>>>>>> Heap once stable.
>>>>>>>
>>>>>>>
>>>>>>> <https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png>
>>>>>>>
>>>>>>>
>>>>>>> * Below are list of events that happened according to marvel :*
>>>>>>>
>>>>>>> 2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX]
>>>>>>> joined
>>>>>>>
>>>>>>> 2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left
>>>>>>>
>>>>>>> 2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
>>>>>>> became master
>>>>>>>
>>>>>>> 2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX]
>>>>>>> joined
>>>>>>>
>>>>>>> 2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX]
>>>>>>> joined
>>>>>>>
>>>>>>> 2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left
>>>>>>>
>>>>>>> 2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
>>>>>>> became master
>>>>>>>
>>>>>>> 2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left
>>>>>>>
>>>>>>> 2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left
>>>>>>>
>>>>>>> 2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left
>>>>>>>
>>>>>>> 2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became
>>>>>>> master
>>>>>>>
>>>>>>> 2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left
>>>>>>>
>>>>>>> 2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left
>>>>>>>
>>>>>>> 2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became
>>>>>>> master
>>>>>>>
>>>>>>> 2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left
>>>>>>>
>>>>>>> *All I can find are some info logs, when the master is elected and
>>>>>>> "reason: zen-disco-master_failed"*
>>>>>>>
>>>>>>> [2014-12-23 17:32:27,668][INFO ][cluster.service          ]
>>>>>>> [E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007]
>>>>>>> [inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2,
>>>>>>> master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaO
>>>>>>> Xg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
>>>>>>> master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBa
>>>>>>> OXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
>>>>>>> master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_
>>>>>>> QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
>>>>>>> max_local_storage_nodes=2, master=true})
>>>>>>>
>>>>>>> *I couldn't find any other errors or warnings around this time. All
>>>>>>> I can find is OOM errors which I found that is also happening before. *
>>>>>>>
>>>>>>> *I found similar logs in all the nodes just before the node left : *
>>>>>>>
>>>>>>> [2014-12-23 17:38:20,117][WARN ][index.translog           ]
>>>>>>> [E0007-1] [xxxx70246][10] failed to flush shard on translog threshold
>>>>>>>
>>>>>>> org.elasticsearch.index.engine.FlushFailedEngineException:
>>>>>>> [xxxx10170246][10] Flush failed
>>>>>>>
>>>>>>> at org.elasticsearch.index.engine.internal.InternalEngine.flush(
>>>>>>> InternalEngine.java:868)
>>>>>>>
>>>>>>> at org.elasticsearch.index.shard.service.InternalIndexShard.flu
>>>>>>> sh(InternalIndexShard.java:609)
>>>>>>>
>>>>>>> at org.elasticsearch.index.translog.TranslogService$TranslogBas
>>>>>>> edFlush$1.run(TranslogService.java:201)
>>>>>>>
>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>>> Executor.java:1145)
>>>>>>>
>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>>> lExecutor.java:615)
>>>>>>>
>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>
>>>>>>> Caused by: java.lang.IllegalStateException: this writer hit an
>>>>>>> OutOfMemoryError; cannot commit
>>>>>>>
>>>>>>> at org.apache.lucene.index.IndexWriter.prepareCommitInternal(In
>>>>>>> dexWriter.java:2941)
>>>>>>>
>>>>>>>  at org.apache.lucene.index.IndexWriter.commitInternal(IndexWrit
>>>>>>> er.java:3122)
>>>>>>>
>>>>>>> *Also, found some transport expections, which are not new *
>>>>>>>
>>>>>>> 2014-12-23 17:37:52,328][WARN ][search.action            ] [E0007-1]
>>>>>>> Failed to send release search context
>>>>>>>
>>>>>>> org.elasticsearch.transport.SendRequestTransportException:
>>>>>>> [E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/fr
>>>>>>> eeContext]
>>>>>>>
>>>>>>> at org.elasticsearch.transport.TransportService.sendRequest(Tra
>>>>>>> nsportService.java:220)
>>>>>>>
>>>>>>> at org.elasticsearch.transport.TransportService.sendRequest(Tra
>>>>>>> nsportService.java:190)
>>>>>>>
>>>>>>> at org.elasticsearch.search.action.SearchServiceTransportAction.
>>>>>>> sendFreeContext(SearchServiceTransportAction.java:125)
>>>>>>>
>>>>>>> at org.elasticsearch.action.search.type.TransportSearchTypeAction$
>>>>>>> BaseAsyncAction.releaseIrrelevantSearchContexts(
>>>>>>> TransportSearchTypeAction.java:348)
>>>>>>>
>>>>>>> at org.elasticsearch.action.search.type.TransportSearchQueryThe
>>>>>>> nFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchA
>>>>>>> ction.java:147)
>>>>>>>
>>>>>>> at org.elasticsearch.action.search.type.TransportSea
>>>>>>>
>>>>>>> *The cluster recovered after the 7 minutes and is back up and green.
>>>>>>> Can these errors cause the nodes to not respond making the cluster think
>>>>>>> the node is dead and elect a new master and so forth ? If not I was
>>>>>>> wondering If I can get some pointers on where to look ? Or what might 
>>>>>>> have
>>>>>>> happened.*
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Abhishek
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/832af056-f21
>>>>>>> c-438c-97be-3794e97549b1%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to a topic in
>>>>>> the Google Groups "elasticsearch" group.
>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>>> pic/elasticsearch/_S728XU6jks/unsubscribe.
>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>> elasticsearc...@googlegroups.com.
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DT
>>>>>> g6e17uKw5%2Brtw%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2
>>>>> BfOWzqESa5Xkw%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Cascading cluster failure

Reply via email to