I just had an incident where my entire cluster (all nodes) ended up using 100% cpu on each nod at the same time and become completely unresponsive even to /_cluster/health. This happened while I was using Kibana, which was working fine up to that point. I was running a few simple queries (nothing fancy, see log below). Even the shutdown call failed. After restarting, the cluster seems fine again (green state after a few minutes) but I'd like to prevent similar incidents in the future. This is the second time I've seen elasticsearch die in this way. Last time, I got some helpful suggestions to put some limits on the field data cache, which I did but obviously that wasn't the (only) problem.
I'm running a three node cluster with elasticsearch 1.1.1, java 1.7.0_55 I'm using logstash and kibana. I have about three months worth of indices (5 shards, 2 replicas) and a fresh index every day. 210G in total. Kibana loads quite slowly but runs fine once loaded. I don't see evidence of out of memory errors in the logs but it seems garbage collects are very slow (5s). So obviously something is leaking memory over time. Uptime of the cluster must have been around three weeks (since last time it died). Here's the log just before one of the nodes died. You can see the query I ran actually failed for valid reasons because the (small) index I was trying to query from kibana does not actually have the timestamp field. [2014-05-16 14:28:54,880][INFO ][monitor.jvm ] [192.168.1.13] [gc][old][1562124][155961] duration [6.6s], collections [1]/[12.3s], total [6.6s]/[9.3h], memory [4.9gb]->[4.9gb]/[4.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [43.3mb]->[38.2mb]/[66.5mb]}{[old] [4.3gb]->[4.3gb]/[4.3gb]} [2014-05-16 14:28:54,880][DEBUG][action.search.type ] [192.168.1.13] [deals_v5][4], node[ObP5F1ZSQKWOJrAb6T1T0g], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard [true] org.elasticsearch.search.SearchParseException: [deals_v5][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:deal"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1400239478674,"to":"now"}}}]}}}}}}}},"size":0}]] at org.elasticsearch.search.SearchService.parseSource(SearchService.java:634) at org.elasticsearch.search.SearchService.createContext(SearchService.java:507) at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480) at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252) at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202) at org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:70) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: Facet [0]: (key) field [@timestamp] not found at org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160) at org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93) at org.elasticsearch.search.SearchService.parseSource(SearchService.java:622) ... 10 more [2014-05-16 14:28:54,935][DEBUG][action.search.type ] [192.168.1.13] [deals_v5][0], node[ObP5F1ZSQKWOJrAb6T1T0g], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@4622b34d] lastShard [true] org.elasticsearch.search.SearchParseException: [deals_v5][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:deal"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1400239468981,"to":"now"}}}]}}}}}}}},"size":0}]] at org.elasticsearch.search.SearchService.parseSource(SearchService.java:634) at org.elasticsearch.search.SearchService.createContext(SearchService.java:507) at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480) at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252) at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202) at org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:70) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: Facet [0]: (key) field [@timestamp] not found at org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160) at org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93) at org.elasticsearch.search.SearchService.parseSource(SearchService.java:622) ... 10 more [2014-05-16 14:28:54,947][DEBUG][action.search.type ] [192.168.1.13] [deals_v5][3], node[_MHmRPPoR-SzgaxT4FL8wQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@4622b34d] lastShard [true] org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected [2014-05-16 14:29:00,036][INFO ][monitor.jvm ] [192.168.1.13] [gc][old][1562125][155962] duration [5s], collections [1]/[5.1s], total [5s]/[9.3h], memory [4.9gb]->[4.9gb]/[4.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [38.2mb]->[63.9mb]/[66.5mb]}{[old] [4.3gb]->[4.3gb]/[4.3gb]} [2014-05-16 14:29:06,531][DEBUG][action.search.type ] [192.168.1.13] [deals_v5][2], node[_MHmRPPoR-SzgaxT4FL8wQ], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard [true] org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected [2014-05-16 14:30:30,836][DEBUG][action.search.type ] [192.168.1.13] [deals_v5][0], node[_MHmRPPoR-SzgaxT4FL8wQ], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard [true] org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected After this last message, nothing happened until I restarted the cluster. Nothing else got logged. The shutdown call failed and the process was killed the hard way by the init.d script. I'm using ES_HEAP_SIZE=5G and it doesn't seem to be enough. Rather than just blindly increasing heap size, I would like to know how I can prevent it running out of memory to begin with. I've already restricted the fielddata cache, but obviously something else is growing uncontrolled. Beyond limiting the fielddata cache, I've done minimal amounts of tuning. I'd like to have some advice on how I can configure things such that kibana/logstash won't be able to kill my cluster. I'd prefer it to be merely slow rather than completely dead. Slow I can detect and fix but dead & unresponsive is a disaster. This cluster is mostly doing nothing else but indexing the steady stream of logstash data. I occasionally look at some kibana dashboards a couple of times per day. Here's my config: cluster.name: linko_elasticsearch node.name: 192.168.1.14 # force the cluster transport to use the correct network transport.host: _eth1:ipv4_ discovery.zen.ping.unicast.hosts: 192.168.1.13,192.168.1.14,192.168.1.15 path.data: /opt/elasticsearch-data # prevent swapping bootstrap.mlockall: true indices.fielddata.cache.size: 30% indices.fielddata.cache.expire: 1h index.cache.filter.expire: 1h What other settings should I be configuring here to minimize risk of es ever running out of memory? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9fbc79a9-ce55-4a96-8351-5d9c9645ceba%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.