Bryan Beaudreault created HBASE-28358:
-----------------------------------------

             Summary: AsyncProcess inconsistent exception thrown for operation 
timeout
                 Key: HBASE-28358
                 URL: https://issues.apache.org/jira/browse/HBASE-28358
             Project: HBase
          Issue Type: Bug
            Reporter: Bryan Beaudreault


I'm not sure if I'll get to this, but wanted to log it as a known issue.

AsyncProcess has a design where it breaks the batch into sub-batches based on 
regionserver, then submits a callable per regionserver in a threadpool. In the 
main thread, it calls waitUntilDone() with an operation timeout. If the 
callables don't finish within the operation timeout, a SocketTimeoutException 
is thrown. This exception is not very useful because it doesn't give you any 
sense of how many calls were in progress, on which servers, or why it's delayed.

Recently we've been improving the adherence to operation timeout within the 
callables themselves. The main driver here has been to ensure we don't 
erroneously clear the meta cache for operation timeout related errors. So we've 
added a new OperationTimeoutExceededException, which is thrown from within the 
callables and does not cause a meta cache clear. The added benefit is that if 
these bubble up to the caller, they are wrapped in 
RetriesExhaustedWithDetailsException which includes a lot more info about which 
server and which action is affected. 

Now we've covered most but not all cases where operation timeout is exceeded. 
So when exceeding operation timeout it's possible sometimes to see a 
SocketTimeoutException from waitUntilDone, and sometimes see 
OperationTimeoutExceededException from the callables. It will depend on which 
one fails first. It may be nice to finish the swing here, ensuring that we 
always throw OperationTimeoutExceededException from the callables.

The main remaining case is in the call to locateRegion, which hits meta and 
does not honor the call's operation timeout (instead meta operation timeout). 
Resolving this would require some refactoring of 
ConnectionImplementation.locateRegion to allow passing an operation timeout and 
having that affect the userRegionLock and meta scan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to