Ooops. Sorry. That was a copy/paste error. It is using 16GB. Here is the correct process arguments:
/usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server -XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch [snip CP] Thanks! Chris On Thu, Jul 31, 2014 at 2:43 AM, David Pilato <da...@pilato.fr> wrote: > Why do you start with 8gb HEAP? Can't you give 16gb or so? > > /usr/bin/java -Xms8g -Xmx8g > > > > -- > David ;-) > Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs > > > Le 30 juil. 2014 à 19:47, Chris Neal <chris.n...@derbysoft.net> a écrit : > > Hi everyone, > > First off, apologies for the thread. I know OOME discussions are somewhat > overdone in the group, but I need to reach out for some help for this one. > > I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That means > 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for > Elasticsearch data on each AMI. > > I'm running Java 1.7.0_55, and using the G1 collector. The Java args are: > > /usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server > -XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20 > -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > The index has 2 shards, each with 1 replica. > > I have a daily index being filled with application log data. The index, > on average, gets to be about: > 486M documents > 53.1GB (primary size) > 106.2GB (total size) > > Other than indexing, there really is nothing going on in the cluster. No > searches, or percolators, just collecting data. > > I have: > > - Tweaked the idex.merge.policy > - Tweaked the indices.fielddata.breaker.limit and cache.size > - change the index refresh_interval from 1s to 60s > - created a default template for the index such that _all is disabled, > and all fields in the mapping are set to "not_analyzed". > > Here is my complete elasticsearch.yml: > > action: > disable_delete_all_indices: true > cluster: > name: elasticsearch-dev > discovery: > zen: > minimum_master_nodes: 2 > ping: > multicast: > enabled: false > unicast: > hosts: 10.0.0.45,10.0.0.41 > gateway: > recover_after_nodes: 2 > index: > merge: > policy: > max_merge_at_once: 5 > max_merged_segment: 15gb > number_of_replicas: 1 > number_of_shards: 2 > refresh_interval: 60s > indices: > fielddata: > breaker: > limit: 50% > cache: > size: 30% > node: > name: elasticsearch-ip-10-0-0-45 > path: > data: > - /usr/local/ebs01/elasticsearch > - /usr/local/ebs02/elasticsearch > threadpool: > bulk: > queue_size: 500 > size: 75 > type: fixed > get: > queue_size: 200 > size: 100 > type: fixed > index: > queue_size: 1000 > size: 100 > type: fixed > search: > queue_size: 200 > size: 100 > type: fixed > > The heap sits about 13GB used. I had been batting OOME exceptions for > awhile, and thought I had it licked, but one just popped up again. My > cluster has been up and running fine for 14 days, and I just got this OOME: > > ===== > [2014-07-30 11:52:28,394][INFO ][monitor.jvm ] > [elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms], > collections [1]/[1s], total [770ms]/[43.2m], memory > [13.4gb]->[13.4gb]/[16gb], all_pools {[young] > [648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] > [12.8gb]->[13.4gb]/[16gb]} > [2014-07-30 15:03:01,070][WARN ][index.engine.internal ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of > memory] > [2014-07-30 15:03:10,324][WARN > ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the > selector loop. > java.lang.OutOfMemoryError: Java heap space > [2014-07-30 15:03:10,335][WARN > ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the > selector loop. > java.lang.OutOfMemoryError: Java heap space > [2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge > java.lang.OutOfMemoryError: Java heap space > [2014-07-30 15:03:28,595][WARN ][index.translog ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard > on translog threshold > org.elasticsearch.index.engine.FlushFailedEngineException: > [derbysoft-20140730][0] Flush failed > at > org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805) > at > org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604) > at > org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.IllegalStateException: this writer hit an > OutOfMemoryError; cannot commit > at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416) > at > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989) > at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096) > at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063) > at > org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797) > ... 5 more > [2014-07-30 15:03:28,658][WARN ][cluster.action.shard ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard > for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED], > indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of > memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot > commit]]] > [2014-07-30 15:34:36,418][WARN > ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the > selector loop. > java.lang.OutOfMemoryError: Java heap space > [2014-07-30 15:34:39,847][WARN > ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the > selector loop. > java.lang.OutOfMemoryError: Java heap space > [2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge > java.lang.OutOfMemoryError: Java heap space > [2014-07-30 15:34:42,873][WARN ][index.engine.internal ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge > exception] > [2014-07-30 15:34:43,185][WARN ][cluster.action.shard ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard > for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED], > indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge > exception][MergeException[java.lang.OutOfMemoryError: Java heap space]; > nested: OutOfMemoryError[Java heap space]; ]] > [2014-07-30 15:57:42,531][WARN ][indices.recovery ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from > [[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us > -west-2.compute.internal][inet[/10.0.0.45:9300]]] failed > org.elasticsearch.transport.RemoteTransportException: > [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300 > ]][index/shard/recovery/startRecovery] > Caused by: org.elasticsearch.index.engine.RecoveryEngineException: > [derbysoft-20140730][1] Phase[2] Execution failed > at > org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011) > at > org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631) > at > org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122) > at > org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62) > at > org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351) > at > org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337) > at > org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: > [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] > request_id [13988539] timed out after [900000ms] > at > org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369) > ... 3 more > [2014-07-30 15:57:42,534][WARN ][indices.cluster ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard > org.elasticsearch.indices.recovery.RecoveryFailedException: > [derbysoft-20140730][1]: Recovery failed from > [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us > -west-2.compute.internal][inet[/10.0.0.45:9300]] into > [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us > -west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/ > 10.0.0.41:9300]] > at > org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306) > at > org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65) > at > org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.elasticsearch.transport.RemoteTransportException: > [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300 > ]][index/shard/recovery/startRecovery] > Caused by: org.elasticsearch.index.engine.RecoveryEngineException: > [derbysoft-20140730][1] Phase[2] Execution failed > at > org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011) > at > org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631) > at > org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122) > at > org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62) > at > org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351) > at > org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337) > at > org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: > [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] > request_id [13988539] timed out after [900000ms] > at > org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369) > ... 3 more > [2014-07-30 15:57:42,535][WARN ][cluster.action.shard ] > [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard > for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R], > s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to > start shard, message [RecoveryFailedException[[derbysoft-20140730][1]: > Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ > ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into > [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us > -west-2.compute.internal][inet[ip-10-0-0-41.us > -west-2.compute.internal/10.0.0.41:9300]]]; nested: > RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]]; > nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution > failed]; nested: > ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] > request_id [13988539] timed out after [900000ms]]; ]] > > ===== > > I'm a bit at a loss as to what to try next to address this problem. Can > anyone offer a suggestion? > Thanks for reading this. > > Chris > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr > <https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dpi-hcOfbPMFvVoNZTF5pDXcGLHqPazGo2Hj3aGvBSh_pQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.