Bryan Beaudreault created HBASE-27768: -----------------------------------------
Summary: Race conditions in BlockingRpcConnection Key: HBASE-27768 URL: https://issues.apache.org/jira/browse/HBASE-27768 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault We've been experiencing strange timeouts since upgrading to hbase2 client. We use BlockingRpcConnection for now until we migrate our auth stack to native TLS. In diagnosing the timeouts, I noticed a few issues in this class: # Most importantly, there is a race condition which can result in a case where a BlockingRpcConnection instance has 2 reader threads running. In this case, both are competing for the socket and it causes weird timeouts and in some cases corrupted response (i.e. InvalidProtocolBufferException) # The waitForWork loop does not properly handle interruption. When it gets interrupted, if the above race condition occurs, the waitForWork loop ends up forever being in a tight loop. The "wait()" call instantly throws InterruptedException, and we set interrupted state back and restart the loop. So no waiting is occurring anymore. The race condition is somewhat rare, only occurring in certain failure scenarios on our highest volume clients. But when it happens, a low level of errors will forever be thrown for the affected server connection until the client is bounced. -- This message was sent by Atlassian Jira (v8.20.10#820010)