I see the same problem. We are running 1.1.1 on a 13-node cluster (3 master and 5+5 data). I see stuck threads on most of the data nodes, I had a look around on one of them. Top in thread mode shows: top - 08:08:20 up 62 days, 18:49, 1 user, load average: 9.18, 13.21, 12.67 Threads: 528 total, 14 running, 514 sleeping, 0 stopped, 0 zombie %Cpu(s): 39.0 us, 1.5 sy, 0.0 ni, 59.0 id, 0.2 wa, 0.2 hi, 0.0 si, 0.1 st KiB Mem: 62227892 total, 61933428 used, 294464 free, 65808 buffers KiB Swap: 61865980 total, 19384 used, 61846596 free. 24645668 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3743 elastic+ 20 0 1.151t 0.045t 0.013t S 93.4 78.1 17462:00 java 3748 elastic+ 20 0 1.151t 0.045t 0.013t S 93.4 78.1 17457:55 java 3761 elastic+ 20 0 1.151t 0.045t 0.013t S 93.1 78.1 17455:21 java 3744 elastic+ 20 0 1.151t 0.045t 0.013t S 92.7 78.1 17456:55 java 1758 elastic+ 20 0 1.151t 0.045t 0.013t R 5.9 78.1 3450:01 java 1755 elastic+ 20 0 1.151t 0.045t 0.013t R 5.6 78.1 3450:05 java So I have four threads consuming way more CPU than the others. The node is only doing a moderate amount of garbage collection. Running jstack I find that all the stuck threads have stack dump which looks like this: Thread 3744: (state = IN_JAVA) - java.util.HashMap.getEntry(java.lang.Object) @bci=72, line=446 (Compiled frame; information may be imprecise) - java.util.HashMap.get(java.lang.Object) @bci=11, line=405 (Compiled frame) - org.elasticsearch.search.scan.ScanContext$ScanFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext, org.apache.lucene.util.Bits) @bci=8, line=156 (Compiled frame) - org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext, org.apache.lucene.util.Bits) @bci=6, line=45 (Compiled frame) - org.apache.lucene.search.FilteredQuery$1.scorer(org.apache.lucene.index.AtomicReaderContext, boolean, boolean, org.apache.lucene.util.Bits) @bci=34, line=130 (Compiled frame) - org.apache.lucene.search.IndexSearcher.search(java.util.List, org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) @bci=68, line=618 (Compiled frame) - org.elasticsearch.search.internal.ContextIndexSearcher.search(java.util.List, org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) @bci=225, line=173 (Compiled frame) - org.apache.lucene.search.IndexSearcher.search(org.apache.lucene.search.Query, org.apache.lucene.search.Collector) @bci=11, line=309 (Interpreted frame) - org.elasticsearch.search.scan.ScanContext.execute(org.elasticsearch.search.internal.SearchContext) @bci=54, line=52 (Interpreted frame) - org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.search.internal.SearchContext) @bci=174, line=119 (Compiled frame) - org.elasticsearch.search.SearchService.executeScan(org.elasticsearch.search.internal.InternalScrollSearchRequest) @bci=49, line=233 (Interpreted frame) - org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.search.internal.InternalScrollSearchRequest, org.elasticsearch.transport.TransportChannel) @bci=8, line=791 (Interpreted frame) - org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.transport.TransportRequest, org.elasticsearch.transport.TransportChannel) @bci=6, line=780 (Interpreted frame) - org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run() @bci=12, line=270 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=724 (Interpreted frame) The state varies between IN_JAVA an BLOCKED. I took two stack traces 10 minutes apart and they were identical for the suspect threads. I assume this could be a very long running query, but I wonder if it isn't just stuck. Perhaps we are seeing this issue: http://stackoverflow.com/questions/17070184/hashmap-stuck-on-get -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/78b9300c-02c3-4aba-be0e-98d92b97ee7d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.