I see the same problem. We are running 1.1.1 on a 13-node cluster (3 master 
and 5+5 data). I see stuck threads on most of the data nodes, I had a look 
around on one of them. Top in thread mode shows:
top - 08:08:20 up 62 days, 18:49,  1 user,  load average: 9.18, 13.21, 12.67
Threads: 528 total,  14 running, 514 sleeping,   0 stopped,   0 zombie
%Cpu(s): 39.0 us,  1.5 sy,  0.0 ni, 59.0 id,  0.2 wa,  0.2 hi,  0.0 si,  
0.1 st
KiB Mem:  62227892 total, 61933428 used,   294464 free,    65808 buffers
KiB Swap: 61865980 total,    19384 used, 61846596 free. 24645668 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ 
COMMAND                                                                         
                           

 3743 elastic+  20   0  1.151t 0.045t 0.013t S 93.4 78.1  17462:00 
java                                                                            
                           

 3748 elastic+  20   0  1.151t 0.045t 0.013t S 93.4 78.1  17457:55 
java                                                                            
                           

 3761 elastic+  20   0  1.151t 0.045t 0.013t S 93.1 78.1  17455:21 
java                                                                            
                           

 3744 elastic+  20   0  1.151t 0.045t 0.013t S 92.7 78.1  17456:55 
java                                                                            
                           

 1758 elastic+  20   0  1.151t 0.045t 0.013t R  5.9 78.1   3450:01 
java                                                                            
                           

 1755 elastic+  20   0  1.151t 0.045t 0.013t R  5.6 78.1   3450:05 
java                  

So I have four threads consuming way more CPU than the others. The node is 
only doing a moderate amount of garbage collection. Running jstack I find 
that all the stuck threads have  stack dump which looks like this:
Thread 3744: (state = IN_JAVA)
 - java.util.HashMap.getEntry(java.lang.Object) @bci=72, line=446 (Compiled 
frame; information may be imprecise)
 - java.util.HashMap.get(java.lang.Object) @bci=11, line=405 (Compiled 
frame)
 - 
org.elasticsearch.search.scan.ScanContext$ScanFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext,
 
org.apache.lucene.util.Bits) @bci=8, line=156 (Compiled frame)
 - 
org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext,
 
org.apache.lucene.util.Bits) @bci=6, line=45 (Compiled frame)
 - 
org.apache.lucene.search.FilteredQuery$1.scorer(org.apache.lucene.index.AtomicReaderContext,
 
boolean, boolean, org.apache.lucene.util.Bits) @bci=34, line=130 (Compiled 
frame)
 - org.apache.lucene.search.IndexSearcher.search(java.util.List, 
org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) 
@bci=68, line=618 (Compiled frame)
 - 
org.elasticsearch.search.internal.ContextIndexSearcher.search(java.util.List, 
org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) 
@bci=225, line=173 (Compiled frame)
 - 
org.apache.lucene.search.IndexSearcher.search(org.apache.lucene.search.Query, 
org.apache.lucene.search.Collector) @bci=11, line=309 (Interpreted frame)
 - 
org.elasticsearch.search.scan.ScanContext.execute(org.elasticsearch.search.internal.SearchContext)
 
@bci=54, line=52 (Interpreted frame)
 - 
org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.search.internal.SearchContext)
 
@bci=174, line=119 (Compiled frame)
 - 
org.elasticsearch.search.SearchService.executeScan(org.elasticsearch.search.internal.InternalScrollSearchRequest)
 
@bci=49, line=233 (Interpreted frame)
 - 
org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.search.internal.InternalScrollSearchRequest,
 
org.elasticsearch.transport.TransportChannel) @bci=8, line=791 (Interpreted 
frame)
 - 
org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.transport.TransportRequest,
 
org.elasticsearch.transport.TransportChannel) @bci=6, line=780 (Interpreted 
frame)
 - 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run() 
@bci=12, line=270 (Compiled frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 
@bci=95, line=1145 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=724 (Interpreted frame)

The state varies between IN_JAVA an BLOCKED. I took two stack traces 10 
minutes apart and they were identical for the suspect threads.

I assume this could be a very long running query, but I wonder if it isn't 
just stuck. Perhaps we are seeing this issue: 
http://stackoverflow.com/questions/17070184/hashmap-stuck-on-get

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/78b9300c-02c3-4aba-be0e-98d92b97ee7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to