Hello,

We are running an es-cluster with 13 nodes, 10 data and 3 master, on Amazon 
hi1.4xlarge machines. The cluster contains almost 10T of data (including 
one replica). It is running Elasticsearch 1.1.1 on Oracle java  1.7.0_25.

Our problem is that every now and then the cpu load suddenly increases on 
one of the data nodes. The load average can suddenly jump from about 4 up 
to 10-16, and once it has jumped up it stays there. Then after a couple of 
days another node is also affected and so on. Eventually most nodes in the 
cluster are affected and we have to restart them. A restart of the Java 
process brings the load back to normal.

We are not experiencing any abnormal levels of garbage collection on the 
affected nodes.

I did a java stack dump on one of the affected node and one things which 
stood out was that it had a nubber of threads with state IN_JAVA, the 
non-loaded nodes had no such threads. The stack-dump for these threads 
ivariably looks something lie this:

Thread 23022: (state = IN_JAVA)
 - java.util.HashMap.getEntry(java.lang.Object) @bci=72, line=446 (Compiled 
frame; information may be imprecise)
 - java.util.HashMap.get(java.lang.Object) @bci=11, line=405 (Compiled 
frame)
 - 
org.elasticsearch.search.scan.ScanContext$ScanFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext,
 
org.apache.lucene.util.Bits) @bci=8, line=156 (Compiled frame)
 - 
org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext,
 
org.apache.lucene.util.Bits) @bci=6, line=45 (Compiled frame)
 - 
org.apache.lucene.search.FilteredQuery$1.scorer(org.apache.lucene.index.AtomicReaderContext,
 
boolean, boolean, org.apache.lucene.util.Bits) @bci=34, line=130 (Compiled 
frame)
 - org.apache.lucene.search.IndexSearcher.search(java.util.List, 
org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) 
@bci=68, line=618 (Compiled frame)
 - 
org.elasticsearch.search.internal.ContextIndexSearcher.search(java.util.List, 
org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) 
@bci=225, line=173 (Compiled frame)
 - 
org.apache.lucene.search.IndexSearcher.search(org.apache.lucene.search.Query, 
org.apache.lucene.search.Collector) @bci=11, line=309 (Interpreted frame)
 - 
org.elasticsearch.search.scan.ScanContext.execute(org.elasticsearch.search.internal.SearchContext)
 
@bci=54, line=52 (Interpreted frame)
 - 
org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.search.internal.SearchContext)
 
@bci=174, line=119 (Compiled frame)
 - 
org.elasticsearch.search.SearchService.executeScan(org.elasticsearch.search.internal.InternalScrollSearchRequest)
 
@bci=49, line=233 (Interpreted frame)
 - 
org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.search.internal.InternalScrollSearchRequest,
 
org.elasticsearch.transport.TransportChannel) @bci=8, line=791 (Interpreted 
frame)
 - 
org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.transport.TransportRequest,
 
org.elasticsearch.transport.TransportChannel) @bci=6, line=780 (Interpreted 
frame)
 - 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run() 
@bci=12, line=270 (Compiled frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 
@bci=95, line=1145 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=724 (Interpreted frame)

Does anybody know what we are experiencing, or have any tips on how to 
further debug this?

    /MaF

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e83a7e9f-6fe4-4d45-b19c-95f8d8418659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to