Re: Yet another OOME: Java heap space thread :S

Chris Neal Wed, 17 Sep 2014 09:55:23 -0700

Thank you so very much for the reply!

That makes sense.  I will look at the gist as well, and make some changes
to test.


Again, thank you for your time.  I will report back with some results!
Chris


On Wed, Sep 17, 2014 at 11:36 AM, joergpra...@gmail.com <
joergpra...@gmail.com> wrote:

> I have very similar cluster setup here (ES 1.3.2, 64G RAM, 3 nodes, Java
> 8, G1GC,  ~100 shards, ~500g indexes on disk)
>
> This is the culprit
>
> max_merged_segment: 15gb
>
> I recommend
>
> max_merged_segment: 1gb
>
> See also https://gist.github.com/jprante/10666960 (which also holds for
> ES 1.2 and ES 1.3 - these versions have better OOTB defaults for merge)
>
> With this I can use 8g heap for my workload.
>
> Rule of thumb: at each time your heap must be able to cope with an extra
> allocation of max_merged_segment (this is NOT what happens behind the
> scene, it is just a rough estimate)
>
> With 15g in your setting, the risk is high to overallocate the heap when
> your index gets large.
>
> Jörg
>
>
> On Wed, Sep 17, 2014 at 5:43 PM, Chris Neal <chris.n...@derbysoft.net>
> wrote:
>
>> Sorry to bump my own thread, but It's been awhile and I was hoping to get
>> some more eyes on this.  I've since added a third node to the cluster to
>> see if that helps, but it did not.  I still see these OOME on merges on any
>> of the three nodes in the cluster.
>>
>> I have also increased the shard count to 3 to match the number of nodes
>> in the cluster.
>>
>> The error happens on an index that is 44GB in size.
>>
>> The process in top looks like this:
>> ================
>> top - 15:39:59 up 63 days, 18:34,  2 users,  load average: 0.77, 0.66,
>> 0.71
>> Tasks: 343 total,   1 running, 342 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  1.1%us,  0.1%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,  0.0%si,
>>  0.0%st
>> Mem:  30688804k total, 27996932k used,  2691872k free,    62760k buffers
>> Swap: 10485752k total,     5832k used, 10479920k free,  9434584k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>
>> 29539 elastics  20   0  207g  17g 1.1g S 20.2 60.9   2874:16 java
>> ================
>>
>> The process using the 5MB of swap is not elasticsearch, just FYI.
>>
>> If there is any more information I can provide, please let me know.  I'm
>> getting a bit desperate to get this one resolved!
>> Thank you so much for your time.
>> Chris
>>
>> On Thu, Jul 31, 2014 at 10:06 AM, Chris Neal <chris.n...@derbysoft.net>
>> wrote:
>>
>>> Ooops.  Sorry.  That was a copy/paste error.  It is using 16GB.  Here is
>>> the correct process arguments:
>>>
>>> /usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server
>>> -XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>>> -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
>>> -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
>>> -Des.path.home=/usr/share/elasticsearch [snip CP]
>>>
>>> Thanks!
>>> Chris
>>>
>>>
>>> On Thu, Jul 31, 2014 at 2:43 AM, David Pilato <da...@pilato.fr> wrote:
>>>
>>>> Why do you start with 8gb HEAP? Can't you give 16gb or so?
>>>>
>>>> /usr/bin/java -Xms8g -Xmx8g
>>>>
>>>>
>>>>
>>>> --
>>>> David ;-)
>>>> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>>>>
>>>>
>>>> Le 30 juil. 2014 à 19:47, Chris Neal <chris.n...@derbysoft.net> a
>>>> écrit :
>>>>
>>>> Hi everyone,
>>>>
>>>> First off, apologies for the thread.  I know OOME discussions are
>>>> somewhat overdone in the group, but I need to reach out for some help for
>>>> this one.
>>>>
>>>> I have a 2 node development cluster in EC2 on c3.4xlarge AMIs.  That
>>>> means 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
>>>> Elasticsearch data on each AMI.
>>>>
>>>> I'm running Java 1.7.0_55, and using the G1 collector.  The Java args
>>>> are:
>>>>
>>>> /usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
>>>> -XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>>>> -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
>>>> The index has 2 shards, each with 1 replica.
>>>>
>>>> I have a daily index being filled with application log data.  The
>>>> index, on average, gets to be about:
>>>> 486M documents
>>>> 53.1GB (primary size)
>>>> 106.2GB (total size)
>>>>
>>>> Other than indexing, there really is nothing going on in the cluster.
>>>>  No searches, or percolators, just collecting data.
>>>>
>>>> I have:
>>>>
>>>>    - Tweaked the idex.merge.policy
>>>>    - Tweaked the indices.fielddata.breaker.limit and cache.size
>>>>    - change the index refresh_interval from 1s to 60s
>>>>    - created a default template for the index such that _all is
>>>>    disabled, and all fields in the mapping are set to "not_analyzed".
>>>>
>>>> Here is my complete elasticsearch.yml:
>>>>
>>>> action:
>>>>   disable_delete_all_indices: true
>>>> cluster:
>>>>   name: elasticsearch-dev
>>>> discovery:
>>>>   zen:
>>>>     minimum_master_nodes: 2
>>>>     ping:
>>>>       multicast:
>>>>         enabled: false
>>>>       unicast:
>>>>         hosts: 10.0.0.45,10.0.0.41
>>>> gateway:
>>>>   recover_after_nodes: 2
>>>> index:
>>>>   merge:
>>>>     policy:
>>>>       max_merge_at_once: 5
>>>>       max_merged_segment: 15gb
>>>>   number_of_replicas: 1
>>>>   number_of_shards: 2
>>>>   refresh_interval: 60s
>>>> indices:
>>>>   fielddata:
>>>>     breaker:
>>>>       limit: 50%
>>>>     cache:
>>>>       size: 30%
>>>> node:
>>>>   name: elasticsearch-ip-10-0-0-45
>>>> path:
>>>>   data:
>>>>       - /usr/local/ebs01/elasticsearch
>>>>       - /usr/local/ebs02/elasticsearch
>>>> threadpool:
>>>>   bulk:
>>>>     queue_size: 500
>>>>     size: 75
>>>>     type: fixed
>>>>   get:
>>>>     queue_size: 200
>>>>     size: 100
>>>>     type: fixed
>>>>   index:
>>>>     queue_size: 1000
>>>>     size: 100
>>>>     type: fixed
>>>>   search:
>>>>     queue_size: 200
>>>>     size: 100
>>>>     type: fixed
>>>>
>>>> The heap sits about 13GB used.   I had been batting OOME exceptions for
>>>> awhile, and thought I had it licked, but one just popped up again.  My
>>>> cluster has been up and running fine for 14 days, and I just got this OOME:
>>>>
>>>> =====
>>>> [2014-07-30 11:52:28,394][INFO ][monitor.jvm              ]
>>>> [elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms],
>>>> collections [1]/[1s], total [770ms]/[43.2m], memory
>>>> [13.4gb]->[13.4gb]/[16gb], all_pools {[young]
>>>> [648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old]
>>>> [12.8gb]->[13.4gb]/[16gb]}
>>>> [2014-07-30 15:03:01,070][WARN ][index.engine.internal    ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of
>>>> memory]
>>>> [2014-07-30 15:03:10,324][WARN
>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
>>>> selector loop.
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> [2014-07-30 15:03:10,335][WARN
>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
>>>> selector loop.
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> [2014-07-30 15:03:10,324][WARN ][index.merge.scheduler    ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> [2014-07-30 15:03:28,595][WARN ][index.translog           ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard
>>>> on translog threshold
>>>> org.elasticsearch.index.engine.FlushFailedEngineException:
>>>> [derbysoft-20140730][0] Flush failed
>>>> at
>>>> org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
>>>>  at
>>>> org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
>>>> at
>>>> org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
>>>>  at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>  at java.lang.Thread.run(Thread.java:744)
>>>> Caused by: java.lang.IllegalStateException: this writer hit an
>>>> OutOfMemoryError; cannot commit
>>>> at
>>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
>>>>  at
>>>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
>>>> at
>>>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
>>>>  at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
>>>> at
>>>> org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
>>>>  ... 5 more
>>>> [2014-07-30 15:03:28,658][WARN ][cluster.action.shard     ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard
>>>> for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED],
>>>> indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of
>>>> memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot
>>>> commit]]]
>>>> [2014-07-30 15:34:36,418][WARN
>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
>>>> selector loop.
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> [2014-07-30 15:34:39,847][WARN
>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
>>>> selector loop.
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> [2014-07-30 15:34:42,873][WARN ][index.merge.scheduler    ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> [2014-07-30 15:34:42,873][WARN ][index.engine.internal    ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge
>>>> exception]
>>>> [2014-07-30 15:34:43,185][WARN ][cluster.action.shard     ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
>>>> for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED],
>>>> indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge
>>>> exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
>>>> nested: OutOfMemoryError[Java heap space]; ]]
>>>> [2014-07-30 15:57:42,531][WARN ][indices.recovery         ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from
>>>> [[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
>>>> -west-2.compute.internal][inet[/10.0.0.45:9300]]] failed
>>>> org.elasticsearch.transport.RemoteTransportException:
>>>> [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
>>>> ]][index/shard/recovery/startRecovery]
>>>> Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
>>>> [derbysoft-20140730][1] Phase[2] Execution failed
>>>>  at
>>>> org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
>>>> at
>>>> org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
>>>>  at
>>>> org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
>>>> at
>>>> org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
>>>>  at
>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
>>>> at
>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
>>>>  at
>>>> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>  at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:744)
>>>> Caused by:
>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException:
>>>> [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
>>>> request_id [13988539] timed out after [900000ms]
>>>>  at
>>>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
>>>> ... 3 more
>>>> [2014-07-30 15:57:42,534][WARN ][indices.cluster          ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
>>>> org.elasticsearch.indices.recovery.RecoveryFailedException:
>>>> [derbysoft-20140730][1]: Recovery failed from
>>>> [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
>>>> -west-2.compute.internal][inet[/10.0.0.45:9300]] into
>>>> [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
>>>> -west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/
>>>> 10.0.0.41:9300]]
>>>>  at
>>>> org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
>>>> at
>>>> org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
>>>>  at
>>>> org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>  at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:744)
>>>> Caused by: org.elasticsearch.transport.RemoteTransportException:
>>>> [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
>>>> ]][index/shard/recovery/startRecovery]
>>>> Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
>>>> [derbysoft-20140730][1] Phase[2] Execution failed
>>>>  at
>>>> org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
>>>> at
>>>> org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
>>>>  at
>>>> org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
>>>> at
>>>> org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
>>>>  at
>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
>>>> at
>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
>>>>  at
>>>> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>  at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:744)
>>>> Caused by:
>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException:
>>>> [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
>>>> request_id [13988539] timed out after [900000ms]
>>>>  at
>>>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
>>>> ... 3 more
>>>> [2014-07-30 15:57:42,535][WARN ][cluster.action.shard     ]
>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
>>>> for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R],
>>>> s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to
>>>> start shard, message [RecoveryFailedException[[derbysoft-20140730][1]:
>>>> Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][
>>>> ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into
>>>> [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
>>>> -west-2.compute.internal][inet[ip-10-0-0-41.us
>>>> -west-2.compute.internal/10.0.0.41:9300]]]; nested:
>>>> RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
>>>> nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution
>>>> failed]; nested:
>>>> ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
>>>> request_id [13988539] timed out after [900000ms]]; ]]
>>>>
>>>> =====
>>>>
>>>> I'm a bit at a loss as to what to try next to address this problem.
>>>>  Can anyone offer a suggestion?
>>>> Thanks for reading this.
>>>>
>>>> Chris
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr
>>>> <https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAND3Dpj5y17t9y2X-KY62SABYsKa5L1ViP0XbQNYDYjnwEu%3DSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Yet another OOME: Java heap space thread :S

Reply via email to