[jira] [Commented] (ACCUMULO-3975) Deadlock by recursive scans

Dave Marion (JIRA) Thu, 27 Aug 2015 07:45:14 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716781#comment-14716781
 ]


Dave Marion commented on ACCUMULO-3975:
---------------------------------------

Is this really a bug? I think we have always told people not to do scans from 
an iterator. Subsequent comments suggest creating a design document for a new 
feature. Suggest closing this and opening a new ticket to work on the design.

> Deadlock by recursive scans
> ---------------------------
>
>                 Key: ACCUMULO-3975
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3975
>             Project: Accumulo
>          Issue Type: Bug
>          Components: mini, tserver
>    Affects Versions: 1.7.0
>            Reporter: Dylan Hutchison
>             Fix For: 1.8.0
>
>
> A tablet server has a fixed size thread pool that it uses for scanning.  The 
> maximum number of threads is controlled by 
> {{tserver.readahead.concurrent.max}}, which defaults to 16.
> Take the use case of opening a Scanner inside of a server-side iterator.  The 
> following results in deadlock.
> 1. A client creates a BatchScanner (call this A) with enough query threads 
> (say, 16) that  it uses up all the readahead threads on a single tablet 
> server.
> 2. Inside the scan on that unlucky tablet server, an iterator opens a Scanner 
> (call these B) to tablets on the same tablet server.
> 3. The Scanner Bs inside the iterators block because there is no free 
> readahead thread on the target tablet server to serve the request. They never 
> unblock.  Essentially the tserver scan threads block on trying to obtain 
> tserver scan threads from the same thread pool.
> The tablet server does not seem to recover from this event even after the 
> client disconnects (e.g. by killing the client).  Not all the internalRead 
> threads appear to die by IOException, which can prevent subsequent scans with 
> smaller numbers of tablets from succeeding.  It does recover on restarting 
> the tablet server.
> The tablet server has some mechanism to increase the thread pool size at 
> {{rpc.TServerUtils.createSelfResizingThreadPool}}.  It seems to be 
> ineffective.  I see log messages like these:
> {noformat}
> 2015-08-26 21:35:24,247 [rpc.TServerUtils] INFO : Increasing server thread 
> pool size on TabletServer to 33 
> 2015-08-26 21:35:25,248 [rpc.TServerUtils] INFO : Increasing server thread 
> pool size on TabletServer to 33 
> 2015-08-26 21:35:26,250 [rpc.TServerUtils] INFO : Increasing server thread 
> pool size on TabletServer to 33 
> 2015-08-26 21:35:27,252 [rpc.TServerUtils] INFO : Increasing server thread 
> pool size on TabletServer to 33
> {noformat}
> Also a bunch of these pop up, in case it helps
> {noformat}
> 2015-08-26 21:38:29,417 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:34,428 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]  
> 2015-08-26 21:38:39,433 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:44,266 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:38802 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]  
> 2015-08-26 21:38:44,438 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:48,022 [tserver.TabletServer] DEBUG: MultiScanSess 
> 127.0.0.1:38802 0 entries in 0.02 secs (lookup_time:0.02 secs tablets:1 
> ranges:1)  
> 2015-08-26 21:38:48,034 [tserver.TabletServer] DEBUG: MultiScanSess 
> 127.0.0.1:38802 0 entries in 0.01 secs (lookup_time:0.01 secs tablets:1 
> ranges:1)  
> 2015-08-26 21:38:49,452 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:54,456 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:38:59,473 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> 2015-08-26 21:39:04,484 [tserver.TabletServer] DEBUG: ScanSess tid 
> 127.0.0.1:40168 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]  
> {noformat}
> I pushed a [test case that reproduces the deadlock in the Graphulo test 
> code|https://github.com/Accla/graphulo/blob/master/src/test/java/edu/mit/ll/graphulo/AccumuloBugTest.java#L47].
>   It shows that when we use less threads than 
> {{tserver.readahead.concurrent.max}} (16), everything is okay, but if we use 
> more threads then deadlock occurs pretty reliably.
> We can imagine a few kinds of solutions, such as fixing the self-increasing 
> thread pool mechanism that does not appear to work, or making re-entrant 
> thread pools.  Let's find a simple solution.  If I had my druthers, I would 
> create a mechanism for an Accumulo iterator to read from other tables in the 
> same instance without having to open up a Scanner, which is an improvement 
> beyond the scope of this ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-3975) Deadlock by recursive scans

Reply via email to