Hello all, I am searching for a reviewer for a fix of the long standing bug HBASE-27781 that effects the 2.5.3+/2.6.x/branch-2 sync client. The bug has to do with an edge case in the client handling of an operation timeout - there is a scenario where meta slowness can lead to the sync client throwing an unchecked AssertionError which bubbles up to the user application layer (instead of proper handling which would result in a RetriesExhaustedWithDetailsException) - this can be catastrophic because the user application layer is not expecting and should not be catching this unchecked exception - the user application could crash on unchecked exception being encountered/user application threads could silently die. The AssertionError in question here is being explicitly thrown in the client , and is not an “assert <condition>” statement that can be disabled/enabled with the ‘-ea’ JVM flag.
One of the triggering scenarios for the bug is meta slowness, which while not very common, is not exceedingly rare. There has been a lot of sync client work done in the past around better handling of meta slowness / operation timeouts - this bug is also a blocker for HBASE-28730 which attempts to bring to completion the work done in the past needed for the sync client to respect operation timeouts. I have taken care to provide a lot of detail around the bug and the fix in the jira. The functional scope of the changes is limited to the timeout/location error handling in the sync client groupAndSendMulti - the happy path where there are no sync client location errors is completely unaffected by the patch. I have added a test case using MiniCluster which reproduces meta slowness that triggers the bug - without the fix one will see the test error out with AssertionError. I would greatly appreciate a review of the bug fix so we can work towards resolving this long standing bug and also unblock HBASE-28730. JIRA: https://issues.apache.org/jira/browse/HBASE-27781 PR: https://github.com/apache/hbase/pull/7079 Thank you, Daniel Roudnitsky