Bryan Beaudreault created HBASE-27768:
-----------------------------------------

             Summary: Race conditions in BlockingRpcConnection
                 Key: HBASE-27768
                 URL: https://issues.apache.org/jira/browse/HBASE-27768
             Project: HBase
          Issue Type: Bug
            Reporter: Bryan Beaudreault


We've been experiencing strange timeouts since upgrading to hbase2 client. We 
use BlockingRpcConnection for now until we migrate our auth stack to native 
TLS. In diagnosing the timeouts, I noticed a few issues in this class:
 # Most importantly, there is a race condition which can result in a case where 
a BlockingRpcConnection instance has 2 reader threads running. In this case, 
both are competing for the socket and it causes weird timeouts and in some 
cases corrupted response (i.e. InvalidProtocolBufferException)
 # The waitForWork loop does not properly handle interruption. When it gets 
interrupted, if the above race condition occurs, the waitForWork loop ends up 
forever being in a tight loop. The "wait()" call instantly throws 
InterruptedException, and we set interrupted state back and restart the loop. 
So no waiting is occurring anymore.

The race condition is somewhat rare, only occurring in certain failure 
scenarios on our highest volume clients. But when it happens, a low level of 
errors will forever be thrown for the affected server connection until the 
client is bounced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to