Hi all, I'm hoping someone can help me piece together the below log entries/stack traces/Exceptions. I have a 3 node cluster in Development in EC2, and two of them had issues. I'm running ES 1.4.4, 32GB RAM, 16GB heaps, dedicated servers to ES. My idex rate averages about 10k/sec. There were no searches going on at the time of the incident.
It appears to me that node 10.0.0.12 began timing out requests to 10.0.45, indicating that 10.0.0.45 was having issues. Then at 4:36, 10.0.0.12 logs the ERROR about "Uncaught exception: IndexWriter already closed", caused by an OOME. Then at 4:43, 10.0.0.45 hits the "Create failed" WARN, and logs an OOME. Then things are basically down and unresponsive. What is weird to me is that if 10.0.0.45 was the node having issues, why did 10.0.0.12 log an exception 7 minutes before that? Did both nodes run out of memory? Or is one of the Exceptions actually saying, "I see that this other node hit an OOME, and I'm telling you about it." I have a few values tweaked in the elasticsearch.yml file to try and keep this from happening (configured from Puppet): 'indices.breaker.fielddata.limit' => '20%', 'indices.breaker.total.limit' => '25%', 'indices.breaker.request.limit' => '10%', 'index.merge.scheduler.type' => 'concurrent', 'index.merge.scheduler.max_thread_count' => '1', 'index.merge.policy.type' => 'tiered', 'index.merge.policy.max_merged_segment' => '1gb', 'index.merge.policy.segments_per_tier' => '4', 'index.merge.policy.max_merge_at_once' => '4', 'index.merge.policy.max_merge_at_once_explicit' => '4', 'indices.memory.index_buffer_size' => '10%', 'indices.store.throttle.type' => 'none', 'index.translog.flush_threshold_size' => '1GB', I have done a fair bit of reading on this, and have tried about everything I can think of. :( Can anyone tell me what caused this scenario, and what can be done to avoid it? Thank you so much for taking the time to read this. Chris ===== *On server 10.0.0.12 <http://10.0.0.12>:* [2015-03-04 03:56:12,548][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [20456ms] ago, timed out [5392ms] ago, action [cluster:monitor/nodes/st ats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70061596] [2015-03-04 04:06:02,407][INFO ][index.engine.internal ] [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] now throttling indexing: numMergesInFlight=4, maxNumMerges=3 [2015-03-04 04:06:04,141][INFO ][index.engine.internal ] [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] stop throttling indexing: numMergesInFlight=2, maxNumMerges=3 [2015-03-04 04:12:26,194][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [15709ms] ago, timed out [708ms] ago, action [cluster:monitor/nodes/sta ts[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70098828] [2015-03-04 04:23:40,778][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [21030ms] ago, timed out [6030ms] ago, action [cluster:monitor/nodes/st ats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70124234] [2015-03-04 04:24:47,023][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [27275ms] ago, timed out [12275ms] ago, action [cluster:monitor/nodes/s tats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70126273] [2015-03-04 04:25:39,180][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [19431ms] ago, timed out [4431ms] ago, action [cluster:monitor/nodes/st ats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70127835] [2015-03-04 04:26:40,775][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [19241ms] ago, timed out [4241ms] ago, action [cluster:monitor/nodes/st ats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70129981] [2015-03-04 04:27:14,329][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [22676ms] ago, timed out [6688ms] ago, action [cluster:monitor/nodes/stats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70130668] [2015-03-04 04:28:15,695][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [24042ms] ago, timed out [9041ms] ago, action [cluster:monitor/nodes/stats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70132644] [2015-03-04 04:29:38,102][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [16448ms] ago, timed out [1448ms] ago, action [cluster:monitor/nodes/stats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70135333] [2015-03-04 04:33:42,393][WARN ][transport ] [elasticsearch-ip-10-0-0-12] Received response for a request that has timed out, sent [20738ms] ago, timed out [5737ms] ago, action [cluster:monitor/nodes/stats[n]], node [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], id [70142427] [2015-03-04 04:36:08,788][ERROR][marvel.agent ] [elasticsearch-ip-10-0-0-12] Background thread had an uncaught exception: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712) at org.apache.lucene.index.IndexWriter.ramBytesUsed(IndexWriter.java:462) at org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1224) at org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:555) at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:170) at org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:49) at org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212) at org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:172) at org.elasticsearch.node.service.NodeService.stats(NodeService.java:138) at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:300) at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:225) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space ===== *On server 10.0.0.45 <http://10.0.0.45>:* [2015-03-04 04:43:27,245][WARN ][index.engine.internal ] [elasticsearch-ip-10-0-0-45] [myindex-20150304][1] failed engine [indices:data/write/bulk[s] failed on replica] org.elasticsearch.index.engine.CreateFailedEngineException: [myindex-20150304][1] Create failed for [my_type#AUvjGHoiku-fZf277h_4] at org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:421) at org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:403) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:595) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:246) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:225) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246) at org.elasticsearch.index.engine.internal.InternalEngine.innerCreateNoLock(InternalEngine.java:502) at org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:444) at org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:413) ... 8 more Caused by: java.lang.OutOfMemoryError: Java heap space ===== -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.