Re: Yet another OOME: Java heap space thread :S

David Pilato Thu, 31 Jul 2014 00:46:20 -0700

Why do you start with 8gb HEAP? Can't you give 16gb or so?

/usr/bin/java -Xms8g -Xmx8g




--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 30 juil. 2014 à 19:47, Chris Neal <chris.n...@derbysoft.net> a écrit :

Hi everyone,

First off, apologies for the thread.  I know OOME discussions are somewhat 
overdone in the group, but I need to reach out for some help for this one.  

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs.  That means 16 
vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for Elasticsearch 
data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector.  The Java args are:
/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server 
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20 
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError

The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data.  The index, on 
average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster.  No 
searches, or percolators, just collecting data.  

I have:
Tweaked the idex.merge.policy
Tweaked the indices.fielddata.breaker.limit and cache.size
change the index refresh_interval from 1s to 60s
created a default template for the index such that _all is disabled, and all 
fields in the mapping are set to "not_analyzed".
Here is my complete elasticsearch.yml:

action:
  disable_delete_all_indices: true
cluster:
  name: elasticsearch-dev
discovery:
  zen:
    minimum_master_nodes: 2
    ping:
      multicast:
        enabled: false
      unicast:
        hosts: 10.0.0.45,10.0.0.41
gateway:
  recover_after_nodes: 2
index:
  merge:
    policy:
      max_merge_at_once: 5
      max_merged_segment: 15gb
  number_of_replicas: 1
  number_of_shards: 2
  refresh_interval: 60s
indices:
  fielddata:
    breaker:
      limit: 50%
    cache:
      size: 30%
node:
  name: elasticsearch-ip-10-0-0-45
path:
  data:
      - /usr/local/ebs01/elasticsearch
      - /usr/local/ebs02/elasticsearch
threadpool:
  bulk:
    queue_size: 500
    size: 75
    type: fixed
  get:
    queue_size: 200
    size: 100
    type: fixed
  index:
    queue_size: 1000
    size: 100
    type: fixed
  search:
    queue_size: 200
    size: 100
    type: fixed

The heap sits about 13GB used.   I had been batting OOME exceptions for awhile, 
and thought I had it licked, but one just popped up again.  My cluster has been 
up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm              ] 
[elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms], 
collections [1]/[1s], total [770ms]/[43.2m], memory [13.4gb]->[13.4gb]/[16gb], 
all_pools {[young] [648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] 
[12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal    ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of 
memory]
[2014-07-30 15:03:10,324][WARN ][netty.channel.socket.nio.AbstractNioSelector] 
Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN ][netty.channel.socket.nio.AbstractNioSelector] 
Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler    ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog           ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard on 
translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException: 
[derbysoft-20140730][0] Flush failed
        at 
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
        at 
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
        at 
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an 
OutOfMemoryError; cannot commit
        at 
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
        at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
        at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
        at 
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
        ... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard     ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard for 
[derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED], 
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of 
memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot 
commit]]]
[2014-07-30 15:34:36,418][WARN ][netty.channel.socket.nio.AbstractNioSelector] 
Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN ][netty.channel.socket.nio.AbstractNioSelector] 
Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler    ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal    ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge 
exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard     ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard for 
[derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED], 
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge 
exception][MergeException[java.lang.OutOfMemoryError: Java heap space]; nested: 
OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery         ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from 
[[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]]]
 failed
org.elasticsearch.transport.RemoteTransportException: 
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: 
[derbysoft-20140730][1] Phase[2] Execution failed
        at 
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
        at 
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
        at 
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
        at 
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
        at 
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
        at 
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
        at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: 
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
 request_id [13988539] timed out after [900000ms]
        at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
        ... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster          ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException: 
[derbysoft-20140730][1]: Recovery failed from 
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]]
 into 
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/10.0.0.41:9300]]
        at 
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
        at 
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
        at 
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException: 
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: 
[derbysoft-20140730][1] Phase[2] Execution failed
        at 
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
        at 
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
        at 
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
        at 
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
        at 
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
        at 
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
        at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: 
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
 request_id [13988539] timed out after [900000ms]
        at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
        ... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard     ] 
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard for 
[derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[INITIALIZING], 
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to start shard, message 
[RecoveryFailedException[[derbysoft-20140730][1]: Recovery failed from 
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]]
 into 
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/10.0.0.41:9300]]];
 nested: 
RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
 nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution 
failed]; nested: 
ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
 request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem.  Can anyone 
offer a suggestion?
Thanks for reading this.

Chris

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Re: Yet another OOME: Java heap space thread :S

Reply via email to