Hello, We are running an es-cluster with 13 nodes, 10 data and 3 master, on Amazon hi1.4xlarge machines. The cluster contains almost 10T of data (including one replica). It is running Elasticsearch 1.1.1 on Oracle java 1.7.0_25.
Our problem is that every now and then the cpu load suddenly increases on one of the data nodes. The load average can suddenly jump from about 4 up to 10-16, and once it has jumped up it stays there. Then after a couple of days another node is also affected and so on. Eventually most nodes in the cluster are affected and we have to restart them. A restart of the Java process brings the load back to normal. We are not experiencing any abnormal levels of garbage collection on the affected nodes. I did a java stack dump on one of the affected node and one things which stood out was that it had a nubber of threads with state IN_JAVA, the non-loaded nodes had no such threads. The stack-dump for these threads ivariably looks something lie this: Thread 23022: (state = IN_JAVA) - java.util.HashMap.getEntry(java.lang.Object) @bci=72, line=446 (Compiled frame; information may be imprecise) - java.util.HashMap.get(java.lang.Object) @bci=11, line=405 (Compiled frame) - org.elasticsearch.search.scan.ScanContext$ScanFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext, org.apache.lucene.util.Bits) @bci=8, line=156 (Compiled frame) - org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext, org.apache.lucene.util.Bits) @bci=6, line=45 (Compiled frame) - org.apache.lucene.search.FilteredQuery$1.scorer(org.apache.lucene.index.AtomicReaderContext, boolean, boolean, org.apache.lucene.util.Bits) @bci=34, line=130 (Compiled frame) - org.apache.lucene.search.IndexSearcher.search(java.util.List, org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) @bci=68, line=618 (Compiled frame) - org.elasticsearch.search.internal.ContextIndexSearcher.search(java.util.List, org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) @bci=225, line=173 (Compiled frame) - org.apache.lucene.search.IndexSearcher.search(org.apache.lucene.search.Query, org.apache.lucene.search.Collector) @bci=11, line=309 (Interpreted frame) - org.elasticsearch.search.scan.ScanContext.execute(org.elasticsearch.search.internal.SearchContext) @bci=54, line=52 (Interpreted frame) - org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.search.internal.SearchContext) @bci=174, line=119 (Compiled frame) - org.elasticsearch.search.SearchService.executeScan(org.elasticsearch.search.internal.InternalScrollSearchRequest) @bci=49, line=233 (Interpreted frame) - org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.search.internal.InternalScrollSearchRequest, org.elasticsearch.transport.TransportChannel) @bci=8, line=791 (Interpreted frame) - org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.transport.TransportRequest, org.elasticsearch.transport.TransportChannel) @bci=6, line=780 (Interpreted frame) - org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run() @bci=12, line=270 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=724 (Interpreted frame) Does anybody know what we are experiencing, or have any tips on how to further debug this? /MaF -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e83a7e9f-6fe4-4d45-b19c-95f8d8418659%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.