Re: ES OOMing and not triggering cache circuit breakers, using LocalManualCache
After some experimentation, I believe _cluster/stats shows the total field data across the whole cluster. I manged to push my test cluster to 198MiB field data cache usage. As a result, based on Zachary's feedback, I've set the following values in my elasticsearch.yml: indices.fielddata.cache.size: 15gb indices.fielddata.cache.expire: 7d On Thursday, 12 February 2015 15:15:32 UTC, Wilfred Hughes wrote: Oh, is field data per-node or total across the cluster? I grabbed a test cluster with two data nodes, and I deliberately set fielddata really low: indices.fielddata.cache.size: 100mb However, after a few queries, I'm seeing more than 100MiB in use: $ curl http://localhost:9200/_cluster/stats?humanpretty; ... fielddata: { memory_size: 119.7mb, memory_size_in_bytes: 125543995, evictions: 0 }, Is this expected? On Wednesday, 11 February 2015 18:57:28 UTC, Zachary Tong wrote: LocalManualCache is a component of Guava's LRU cache https://code.google.com/p/guava-libraries/source/browse/guava-gwt/src-super/com/google/common/cache/super/com/google/common/cache/CacheBuilder.java, which is used by Elasticsearch for both the filter and field data cache. Based on your node stats, I'd agree it is the field data usage which is causing your OOMs. CircuitBreaker helps prevent OOM, but it works on a per-request basis. It's possible for individual requests to pass the CB because they use small subsets of fields, but over-time the set of fields loaded into Field Data continues to grow and you'll OOM anyway. I would prefer to set a field data limit, rather than an expiration. A hard limit prevents OOM because you don't allow the cache to grow anymore. An expiration does not guarantee that, since you could get a burst of activity that still fills up the heap and OOMs before the expiration can work. -Z On Wednesday, February 11, 2015 at 12:50:45 PM UTC-5, Wilfred Hughes wrote: After examining some other nodes that were using a lot of their heap, I think this is actually field data cache: $ curl http://localhost:9200/_cluster/stats?humanpretty; ... fielddata: { memory_size: 21.3gb, memory_size_in_bytes: 22888612852, evictions: 0 }, filter_cache: { memory_size: 6.1gb, memory_size_in_bytes: 6650700423, evictions: 12214551 }, Since this is storing logstash data, I'm going to add the following lines to my elasticsearch.yml and see if I observe a difference once deployed to production. # Don't hold field data caches for more than a day, since data is # grouped by day and we quickly lose interest in historical data. indices.fielddata.cache.expire: 1d On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote: Hi all I have an ES 1.2.4 cluster which is occasionally running out of heap. I have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest memory users were: org.elasticsearch.common.cache.LocalCache$LocalManualCache 55% org.elasticsearch.indices.cache.filter.IndicesFilterCache 11% and nothing else used more than 1%. It's not clear to me what this cache is. I can't find any references to ManualCache in the elasticsearch source code, and the docs: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/index-modules-fielddata.html suggest to me that the circuit breakers should stop requests or reduce cache usage rather that OOMing. At the moment my cache was filled up, the node was actually trying to index some data: [2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2] [logstash-2015.02.11][0] failed to flush shard on translog threshold org.elasticsearch.index.engine.FlushFailedEngineException: [logstash-2015.02.11][0] Flush failed at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805) at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604) at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063) at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797) ... 5 more [2015-02-11 08:14:29,812][DEBUG][action.bulk
Re: ES OOMing and not triggering cache circuit breakers, using LocalManualCache
Oh, is field data per-node or total across the cluster? I grabbed a test cluster with two data nodes, and I deliberately set fielddata really low: indices.fielddata.cache.size: 100mb However, after a few queries, I'm seeing more than 100MiB in use: $ curl http://localhost:9200/_cluster/stats?humanpretty; ... fielddata: { memory_size: 119.7mb, memory_size_in_bytes: 125543995, evictions: 0 }, Is this expected? On Wednesday, 11 February 2015 18:57:28 UTC, Zachary Tong wrote: LocalManualCache is a component of Guava's LRU cache https://code.google.com/p/guava-libraries/source/browse/guava-gwt/src-super/com/google/common/cache/super/com/google/common/cache/CacheBuilder.java, which is used by Elasticsearch for both the filter and field data cache. Based on your node stats, I'd agree it is the field data usage which is causing your OOMs. CircuitBreaker helps prevent OOM, but it works on a per-request basis. It's possible for individual requests to pass the CB because they use small subsets of fields, but over-time the set of fields loaded into Field Data continues to grow and you'll OOM anyway. I would prefer to set a field data limit, rather than an expiration. A hard limit prevents OOM because you don't allow the cache to grow anymore. An expiration does not guarantee that, since you could get a burst of activity that still fills up the heap and OOMs before the expiration can work. -Z On Wednesday, February 11, 2015 at 12:50:45 PM UTC-5, Wilfred Hughes wrote: After examining some other nodes that were using a lot of their heap, I think this is actually field data cache: $ curl http://localhost:9200/_cluster/stats?humanpretty; ... fielddata: { memory_size: 21.3gb, memory_size_in_bytes: 22888612852, evictions: 0 }, filter_cache: { memory_size: 6.1gb, memory_size_in_bytes: 6650700423, evictions: 12214551 }, Since this is storing logstash data, I'm going to add the following lines to my elasticsearch.yml and see if I observe a difference once deployed to production. # Don't hold field data caches for more than a day, since data is # grouped by day and we quickly lose interest in historical data. indices.fielddata.cache.expire: 1d On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote: Hi all I have an ES 1.2.4 cluster which is occasionally running out of heap. I have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest memory users were: org.elasticsearch.common.cache.LocalCache$LocalManualCache 55% org.elasticsearch.indices.cache.filter.IndicesFilterCache 11% and nothing else used more than 1%. It's not clear to me what this cache is. I can't find any references to ManualCache in the elasticsearch source code, and the docs: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/index-modules-fielddata.html suggest to me that the circuit breakers should stop requests or reduce cache usage rather that OOMing. At the moment my cache was filled up, the node was actually trying to index some data: [2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2] [logstash-2015.02.11][0] failed to flush shard on translog threshold org.elasticsearch.index.engine.FlushFailedEngineException: [logstash-2015.02.11][0] Flush failed at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805) at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604) at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063) at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797) ... 5 more [2015-02-11 08:14:29,812][DEBUG][action.bulk ] [data-node-2] [logstash-2015.02.11][0] failed to execute bulk item (index) index {[logstash-2015.02.11][syslog_slurm][1 org.elasticsearch.index.engine.CreateFailedEngineException: [logstash-2015.02.11][0] Create failed for [syslog_slurm#12UUWk5mR_2A1FGP5W3_1g] at org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393) at
ES OOMing and not triggering cache circuit breakers, using LocalManualCache
Hi all I have an ES 1.2.4 cluster which is occasionally running out of heap. I have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest memory users were: org.elasticsearch.common.cache.LocalCache$LocalManualCache 55% org.elasticsearch.indices.cache.filter.IndicesFilterCache 11% and nothing else used more than 1%. It's not clear to me what this cache is. I can't find any references to ManualCache in the elasticsearch source code, and the docs: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/index-modules-fielddata.html suggest to me that the circuit breakers should stop requests or reduce cache usage rather that OOMing. At the moment my cache was filled up, the node was actually trying to index some data: [2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2] [logstash-2015.02.11][0] failed to flush shard on translog threshold org.elasticsearch.index.engine.FlushFailedEngineException: [logstash-2015.02.11][0] Flush failed at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805) at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604) at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063) at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797) ... 5 more [2015-02-11 08:14:29,812][DEBUG][action.bulk ] [data-node-2] [logstash-2015.02.11][0] failed to execute bulk item (index) index {[logstash-2015.02.11][syslog_slurm][1 org.elasticsearch.index.engine.CreateFailedEngineException: [logstash-2015.02.11][0] Create failed for [syslog_slurm#12UUWk5mR_2A1FGP5W3_1g] at org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393) at org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.fst.BytesStore.writeByte(BytesStore.java:83) at org.apache.lucene.util.fst.FST.init(FST.java:286) at org.apache.lucene.util.fst.Builder.init(Builder.java:163) at org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:422) at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:572) at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:547) at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:214) at org.apache.lucene.util.fst.Builder.add(Builder.java:394) at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1039) at org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:548) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:465) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:518) at
Re: ES OOMing and not triggering cache circuit breakers, using LocalManualCache
LocalManualCache is a component of Guava's LRU cache https://code.google.com/p/guava-libraries/source/browse/guava-gwt/src-super/com/google/common/cache/super/com/google/common/cache/CacheBuilder.java, which is used by Elasticsearch for both the filter and field data cache. Based on your node stats, I'd agree it is the field data usage which is causing your OOMs. CircuitBreaker helps prevent OOM, but it works on a per-request basis. It's possible for individual requests to pass the CB because they use small subsets of fields, but over-time the set of fields loaded into Field Data continues to grow and you'll OOM anyway. I would prefer to set a field data limit, rather than an expiration. A hard limit prevents OOM because you don't allow the cache to grow anymore. An expiration does not guarantee that, since you could get a burst of activity that still fills up the heap and OOMs before the expiration can work. -Z On Wednesday, February 11, 2015 at 12:50:45 PM UTC-5, Wilfred Hughes wrote: After examining some other nodes that were using a lot of their heap, I think this is actually field data cache: $ curl http://localhost:9200/_cluster/stats?humanpretty; ... fielddata: { memory_size: 21.3gb, memory_size_in_bytes: 22888612852, evictions: 0 }, filter_cache: { memory_size: 6.1gb, memory_size_in_bytes: 6650700423, evictions: 12214551 }, Since this is storing logstash data, I'm going to add the following lines to my elasticsearch.yml and see if I observe a difference once deployed to production. # Don't hold field data caches for more than a day, since data is # grouped by day and we quickly lose interest in historical data. indices.fielddata.cache.expire: 1d On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote: Hi all I have an ES 1.2.4 cluster which is occasionally running out of heap. I have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest memory users were: org.elasticsearch.common.cache.LocalCache$LocalManualCache 55% org.elasticsearch.indices.cache.filter.IndicesFilterCache 11% and nothing else used more than 1%. It's not clear to me what this cache is. I can't find any references to ManualCache in the elasticsearch source code, and the docs: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/index-modules-fielddata.html suggest to me that the circuit breakers should stop requests or reduce cache usage rather that OOMing. At the moment my cache was filled up, the node was actually trying to index some data: [2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2] [logstash-2015.02.11][0] failed to flush shard on translog threshold org.elasticsearch.index.engine.FlushFailedEngineException: [logstash-2015.02.11][0] Flush failed at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805) at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604) at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063) at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797) ... 5 more [2015-02-11 08:14:29,812][DEBUG][action.bulk ] [data-node-2] [logstash-2015.02.11][0] failed to execute bulk item (index) index {[logstash-2015.02.11][syslog_slurm][1 org.elasticsearch.index.engine.CreateFailedEngineException: [logstash-2015.02.11][0] Create failed for [syslog_slurm#12UUWk5mR_2A1FGP5W3_1g] at org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393) at org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction at
Re: ES OOMing and not triggering cache circuit breakers, using LocalManualCache
After examining some other nodes that were using a lot of their heap, I think this is actually field data cache: $ curl http://localhost:9200/_cluster/stats?humanpretty; ... fielddata: { memory_size: 21.3gb, memory_size_in_bytes: 22888612852, evictions: 0 }, filter_cache: { memory_size: 6.1gb, memory_size_in_bytes: 6650700423, evictions: 12214551 }, Since this is storing logstash data, I'm going to add the following lines to my elasticsearch.yml and see if I observe a difference once deployed to production. # Don't hold field data caches for more than a day, since data is # grouped by day and we quickly lose interest in historical data. indices.fielddata.cache.expire: 1d On Wednesday, 11 February 2015 16:29:22 UTC, Wilfred Hughes wrote: Hi all I have an ES 1.2.4 cluster which is occasionally running out of heap. I have ES_HEAP_SIZE=31G and according to the heap dump generated, my biggest memory users were: org.elasticsearch.common.cache.LocalCache$LocalManualCache 55% org.elasticsearch.indices.cache.filter.IndicesFilterCache 11% and nothing else used more than 1%. It's not clear to me what this cache is. I can't find any references to ManualCache in the elasticsearch source code, and the docs: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/index-modules-fielddata.html suggest to me that the circuit breakers should stop requests or reduce cache usage rather that OOMing. At the moment my cache was filled up, the node was actually trying to index some data: [2015-02-11 08:14:29,775][WARN ][index.translog ] [data-node-2] [logstash-2015.02.11][0] failed to flush shard on translog threshold org.elasticsearch.index.engine.FlushFailedEngineException: [logstash-2015.02.11][0] Flush failed at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805) at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604) at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063) at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797) ... 5 more [2015-02-11 08:14:29,812][DEBUG][action.bulk ] [data-node-2] [logstash-2015.02.11][0] failed to execute bulk item (index) index {[logstash-2015.02.11][syslog_slurm][1 org.elasticsearch.index.engine.CreateFailedEngineException: [logstash-2015.02.11][0] Create failed for [syslog_slurm#12UUWk5mR_2A1FGP5W3_1g] at org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:393) at org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:384) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:433) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.fst.BytesStore.writeByte(BytesStore.java:83) at org.apache.lucene.util.fst.FST.init(FST.java:286) at org.apache.lucene.util.fst.Builder.init(Builder.java:163) at org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:422) at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:572) at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:547) at