cluster failure

Jilles van Gurp Fri, 16 May 2014 06:21:14 -0700

I just had an incident where my entire cluster (all nodes) ended up using 
100% cpu on each nod at the same time and become completely unresponsive 
even to /_cluster/health. This happened while I was using Kibana, which was 
working fine up to that point. I was running a few simple queries (nothing 
fancy, see log below). Even the shutdown call failed. After restarting, the 
cluster seems fine again (green state after a few minutes) but I'd like to 
prevent similar incidents in the future. This is the second time I've seen 
elasticsearch die in this way. Last time, I got some helpful suggestions to 
put some limits on the field data cache, which I did but obviously that 
wasn't the (only) problem.


I'm running a three node cluster with elasticsearch 1.1.1, java 1.7.0_55

I'm using logstash and kibana. I have about three months worth of indices 
(5 shards, 2 replicas) and a fresh index every day. 210G in total. Kibana 
loads quite slowly but runs fine once loaded.

I don't see evidence of out of memory errors in the logs but it seems 
garbage collects are very slow (5s). So obviously something is leaking 
memory over time. Uptime of the cluster must have been around three weeks 
(since last time it died).

Here's the log just before one of the nodes died. You can see the query I 
ran actually failed for valid reasons because the (small) index I was 
trying to query from kibana does not actually have the timestamp field. 

[2014-05-16 14:28:54,880][INFO ][monitor.jvm              ] [192.168.1.13] 
[gc][old][1562124][155961] duration [6.6s], collections [1]/[12.3s], total 
[6.6s]/[9.3h], memory [4.9gb]->[4.9gb]/[4.9gb], all_pools {[young] 
[532.5mb]->[532.5mb]/[532.5mb]}{[survivor] 
[43.3mb]->[38.2mb]/[66.5mb]}{[old] [4.3gb]->[4.3gb]/[4.3gb]}
[2014-05-16 14:28:54,880][DEBUG][action.search.type       ] [192.168.1.13] 
[deals_v5][4], node[ObP5F1ZSQKWOJrAb6T1T0g], [P], s[STARTED]: Failed to 
execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard 
[true]
org.elasticsearch.search.SearchParseException: [deals_v5][4]: 
from[-1],size[-1]: Parse Failure [Failed to parse source 
[{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:deal"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1400239478674,"to":"now"}}}]}}}}}}}},"size":0}]]
        at 
org.elasticsearch.search.SearchService.parseSource(SearchService.java:634)
        at 
org.elasticsearch.search.SearchService.createContext(SearchService.java:507)
        at 
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480)
        at 
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
        at 
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202)
        at 
org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:70)
        at 
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
        at 
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: 
Facet [0]: (key) field [@timestamp] not found
        at 
org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
        at 
org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93)
        at 
org.elasticsearch.search.SearchService.parseSource(SearchService.java:622)
        ... 10 more
[2014-05-16 14:28:54,935][DEBUG][action.search.type       ] [192.168.1.13] 
[deals_v5][0], node[ObP5F1ZSQKWOJrAb6T1T0g], [P], s[STARTED]: Failed to 
execute [org.elasticsearch.action.search.SearchRequest@4622b34d] lastShard 
[true]
org.elasticsearch.search.SearchParseException: [deals_v5][0]: 
from[-1],size[-1]: Parse Failure [Failed to parse source 
[{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:deal"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1400239468981,"to":"now"}}}]}}}}}}}},"size":0}]]
        at 
org.elasticsearch.search.SearchService.parseSource(SearchService.java:634)
        at 
org.elasticsearch.search.SearchService.createContext(SearchService.java:507)
        at 
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480)
        at 
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
        at 
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202)
        at 
org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:70)
        at 
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
        at 
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: 
Facet [0]: (key) field [@timestamp] not found
        at 
org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
        at 
org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93)
        at 
org.elasticsearch.search.SearchService.parseSource(SearchService.java:622)
        ... 10 more
[2014-05-16 14:28:54,947][DEBUG][action.search.type       ] [192.168.1.13] 
[deals_v5][3], node[_MHmRPPoR-SzgaxT4FL8wQ], [P], s[STARTED]: Failed to 
execute [org.elasticsearch.action.search.SearchRequest@4622b34d] lastShard 
[true]
org.elasticsearch.transport.NodeDisconnectedException: 
[192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected
[2014-05-16 14:29:00,036][INFO ][monitor.jvm              ] [192.168.1.13] 
[gc][old][1562125][155962] duration [5s], collections [1]/[5.1s], total 
[5s]/[9.3h], memory [4.9gb]->[4.9gb]/[4.9gb], all_pools {[young] 
[532.5mb]->[532.5mb]/[532.5mb]}{[survivor] 
[38.2mb]->[63.9mb]/[66.5mb]}{[old] [4.3gb]->[4.3gb]/[4.3gb]}
[2014-05-16 14:29:06,531][DEBUG][action.search.type       ] [192.168.1.13] 
[deals_v5][2], node[_MHmRPPoR-SzgaxT4FL8wQ], [R], s[STARTED]: Failed to 
execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard 
[true]
org.elasticsearch.transport.NodeDisconnectedException: 
[192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected
[2014-05-16 14:30:30,836][DEBUG][action.search.type       ] [192.168.1.13] 
[deals_v5][0], node[_MHmRPPoR-SzgaxT4FL8wQ], [R], s[STARTED]: Failed to 
execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard 
[true]
org.elasticsearch.transport.NodeDisconnectedException: 
[192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected

After this last message, nothing happened until I restarted the cluster. 
Nothing else got logged. The shutdown call failed and the process was 
killed the hard way by the init.d script.

I'm using ES_HEAP_SIZE=5G and it doesn't seem to be enough. Rather than 
just blindly increasing heap size, I would like to know how I can prevent 
it running out of memory to begin with. I've already restricted the 
fielddata cache, but obviously something else is growing uncontrolled. 
Beyond limiting the fielddata cache, I've done minimal amounts of tuning. 
I'd like to have some advice on how I can configure things such that 
kibana/logstash won't be able to kill my cluster. I'd prefer it to be 
merely slow rather than completely dead. Slow I can detect and fix but dead 
& unresponsive is a disaster. This cluster is mostly doing nothing else but 
indexing the steady stream of logstash data. I occasionally look at some 
kibana dashboards a couple of times per day. 


Here's my config:

cluster.name: linko_elasticsearch

node.name: 192.168.1.14

# force the cluster transport to use the correct network
transport.host: _eth1:ipv4_

discovery.zen.ping.unicast.hosts: 192.168.1.13,192.168.1.14,192.168.1.15

path.data: /opt/elasticsearch-data

# prevent swapping
bootstrap.mlockall: true

indices.fielddata.cache.size: 30%
indices.fielddata.cache.expire: 1h
index.cache.filter.expire: 1h

What other settings should I be configuring here to minimize risk of es 
ever running out of memory?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9fbc79a9-ce55-4a96-8351-5d9c9645ceba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

cluster failure

Reply via email to