[ 
https://issues.apache.org/jira/browse/HBASE-17889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961203#comment-15961203
 ] 

huaxiang sun commented on HBASE-17889:
--------------------------------------

Thanks @stack and [~tedyu]. getTaskFuture() is not used anywhere. I will clean 
up code a bit. getFuture/setFuture will be called in different threads (I think 
at least when the threadpool is shutdown, cancel() will be called in a 
different thread), making it volatile seems needed.

The test done is based on 1.2 code. There is a test client who is doing 
continuos GET with consistency TIMELINE. The table has 2 replicas. When the 
region server hosting the primary replica is shutdown with "shutdown -r now", 
The test client is stuck for about 50 seconds, the jstack dump is attached. I 
added trace log in the code, printing out the QueueingFuture reference 
submitted and returned. Found out that before it is stuck, the QueueingFuture 
for replica returned but ones for primary replica did not return. After this 50 
seconds (socket write times out), these QueueingFuture for primary replicas 
returned. This is to confirm that the stucked threads are for the primary 
replicas. 

With this fix, the same test was performed, the testing client did not hang any 
more. The trace log showed that threads for the primary replica got interrupted 
and completed after its cancel() is called.

The master branch code has changed a bit as the lock is not there anymore. I 
think it still applies to the master branch. I will try to do a test with the 
master branch.

> ResultBoundedCompletionService's cancel() needs to interrupt the working 
> thread and free it to the thread-pool
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-17889
>                 URL: https://issues.apache.org/jira/browse/HBASE-17889
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.0.0, 1.4.0, 1.2.6, 1.3.2
>            Reporter: huaxiang sun
>            Assignee: huaxiang sun
>         Attachments: HBASE-17889-master-001.patch, jstack.txt
>
>
> We run into one case with read-replica, when the server hosting the primary 
> region is shutdown, we see Get did not go to replica region and it paused for 
> about 50 seconds before Get was resumed. 
> More debugging finds out that when the server is down, one of the threads was 
> stuck at the write, it holds lock at 
> https://github.com/apache/hbase/blob/branch-1.3/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClientImpl.java#L916.
> The later write threads were waiting on this lock until all threads in the 
> connection's thread pool were stuck on this lock. At that moment, no work 
> will be done. After socket write times out, it frees up all threads and it 
> continues.
> When QueueingFuture#cancel() is called, it does not interrupt the working 
> thread and return the thread to the pool.
> Attaching the jstack trace.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to