[ https://issues.apache.org/jira/browse/HBASE-27768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Beaudreault reassigned HBASE-27768: ----------------------------------------- Assignee: Bryan Beaudreault > Race conditions in BlockingRpcConnection > ---------------------------------------- > > Key: HBASE-27768 > URL: https://issues.apache.org/jira/browse/HBASE-27768 > Project: HBase > Issue Type: Bug > Reporter: Bryan Beaudreault > Assignee: Bryan Beaudreault > Priority: Major > > We've been experiencing strange timeouts since upgrading to hbase2 client. We > use BlockingRpcConnection for now until we migrate our auth stack to native > TLS. In diagnosing the timeouts, I noticed a few issues in this class: > # Most importantly, there is a race condition which can result in a case > where a BlockingRpcConnection instance has 2 reader threads running. In this > case, both are competing for the socket and it causes weird timeouts and in > some cases corrupted response (i.e. InvalidProtocolBufferException) > # The waitForWork loop does not properly handle interruption. When it gets > interrupted, if the above race condition occurs, the waitForWork loop ends up > forever being in a tight loop. The "wait()" call instantly throws > InterruptedException, and we set interrupted state back and restart the loop. > So no waiting is occurring anymore. > The race condition is somewhat rare, only occurring in certain failure > scenarios on our highest volume clients. But when it happens, a low level of > errors will forever be thrown for the affected server connection until the > client is bounced. -- This message was sent by Atlassian Jira (v8.20.10#820010)