[ https://issues.apache.org/jira/browse/ACCUMULO-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716781#comment-14716781 ]
Dave Marion commented on ACCUMULO-3975: --------------------------------------- Is this really a bug? I think we have always told people not to do scans from an iterator. Subsequent comments suggest creating a design document for a new feature. Suggest closing this and opening a new ticket to work on the design. > Deadlock by recursive scans > --------------------------- > > Key: ACCUMULO-3975 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3975 > Project: Accumulo > Issue Type: Bug > Components: mini, tserver > Affects Versions: 1.7.0 > Reporter: Dylan Hutchison > Fix For: 1.8.0 > > > A tablet server has a fixed size thread pool that it uses for scanning. The > maximum number of threads is controlled by > {{tserver.readahead.concurrent.max}}, which defaults to 16. > Take the use case of opening a Scanner inside of a server-side iterator. The > following results in deadlock. > 1. A client creates a BatchScanner (call this A) with enough query threads > (say, 16) that it uses up all the readahead threads on a single tablet > server. > 2. Inside the scan on that unlucky tablet server, an iterator opens a Scanner > (call these B) to tablets on the same tablet server. > 3. The Scanner Bs inside the iterators block because there is no free > readahead thread on the target tablet server to serve the request. They never > unblock. Essentially the tserver scan threads block on trying to obtain > tserver scan threads from the same thread pool. > The tablet server does not seem to recover from this event even after the > client disconnects (e.g. by killing the client). Not all the internalRead > threads appear to die by IOException, which can prevent subsequent scans with > smaller numbers of tablets from succeeding. It does recover on restarting > the tablet server. > The tablet server has some mechanism to increase the thread pool size at > {{rpc.TServerUtils.createSelfResizingThreadPool}}. It seems to be > ineffective. I see log messages like these: > {noformat} > 2015-08-26 21:35:24,247 [rpc.TServerUtils] INFO : Increasing server thread > pool size on TabletServer to 33 > 2015-08-26 21:35:25,248 [rpc.TServerUtils] INFO : Increasing server thread > pool size on TabletServer to 33 > 2015-08-26 21:35:26,250 [rpc.TServerUtils] INFO : Increasing server thread > pool size on TabletServer to 33 > 2015-08-26 21:35:27,252 [rpc.TServerUtils] INFO : Increasing server thread > pool size on TabletServer to 33 > {noformat} > Also a bunch of these pop up, in case it helps > {noformat} > 2015-08-26 21:38:29,417 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 2015-08-26 21:38:34,428 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1] > 2015-08-26 21:38:39,433 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 2015-08-26 21:38:44,266 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:38802 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] > 2015-08-26 21:38:44,438 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 2015-08-26 21:38:48,022 [tserver.TabletServer] DEBUG: MultiScanSess > 127.0.0.1:38802 0 entries in 0.02 secs (lookup_time:0.02 secs tablets:1 > ranges:1) > 2015-08-26 21:38:48,034 [tserver.TabletServer] DEBUG: MultiScanSess > 127.0.0.1:38802 0 entries in 0.01 secs (lookup_time:0.01 secs tablets:1 > ranges:1) > 2015-08-26 21:38:49,452 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 2015-08-26 21:38:54,456 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 2015-08-26 21:38:59,473 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 2015-08-26 21:39:04,484 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > {noformat} > I pushed a [test case that reproduces the deadlock in the Graphulo test > code|https://github.com/Accla/graphulo/blob/master/src/test/java/edu/mit/ll/graphulo/AccumuloBugTest.java#L47]. > It shows that when we use less threads than > {{tserver.readahead.concurrent.max}} (16), everything is okay, but if we use > more threads then deadlock occurs pretty reliably. > We can imagine a few kinds of solutions, such as fixing the self-increasing > thread pool mechanism that does not appear to work, or making re-entrant > thread pools. Let's find a simple solution. If I had my druthers, I would > create a mechanism for an Accumulo iterator to read from other tables in the > same instance without having to open up a Scanner, which is an improvement > beyond the scope of this ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)