[ 
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950027#comment-17950027
 ] 

Daniel Roudnitsky edited comment on HBASE-27781 at 5/7/25 3:03 PM:
-------------------------------------------------------------------

[~ndimiduk] to clarify , my understanding is that in the context of 
groupAndSendMulti where this bug is, already finished == failed to completion 
due to failure to resolve region location. The actions that get passed to 
groupAndSendMulti are incomplete when they get to groupAndSendMulti for region 
location resolution and grouping by server, and the only time we "complete" an 
action in groupAndSendMulti is when we fail an action due to failure to resolve 
the region location (there is also the case of a replica action which could 
concurrently succeed, but thats handled properly from what I have observed), so 
these actions which we fail due to failed location resolution are the only ones 
we have to take care not to "double fail" and hit this assertion error. In my 
patch we don't fail all actions in the entire batch operation, we only fail the 
actions in the sub-batch that is being processed in groupAndSendMulti at the 
time of the operation timeout being exceeded, any actions that were already 
completed successfully are not failed, those responses are preserved. 


was (Author: JIRAUSER304178):
[~ndimiduk] to clarify , my understanding is that in the context of 
groupAndSendMulti where this bug is, already finished == failed to completion 
due to failure to resolve region location. The actions that get passed to 
groupAndSendMulti are incomplete when they get to groupAndSendMulti for region 
location resolution and grouping by server, and the only time we "complete" an 
action in groupAndSendMulti is when we fail an action due to failure to resolve 
the region location (there is also the case of a replica action which could 
concurrently succeed, but thats handled properly from what I have observed), so 
these actions which we fail due to location resolution are the only ones we 
have to take care not to "double fail" and hit this assertion error. In my 
patch we don't fail all actions in the entire batch operation, we only fail the 
actions in the sub-batch that is being processed in groupAndSendMulti at the 
time of operation timeout being exceeded, any actions that were already 
completed successfully are not failed, those responses are preserved. 

> AssertionError in AsyncRequestFutureImpl when timing out during location 
> resolution
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-27781
>                 URL: https://issues.apache.org/jira/browse/HBASE-27781
>             Project: HBase
>          Issue Type: Bug
>          Components: asyncclient
>            Reporter: Bryan Beaudreault
>            Assignee: Daniel Roudnitsky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
> during location resolution 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
>  In that handling, we loop all actions and set them as failed. The problem 
> is, some number of actions may already finished when we get to this spot. So 
> the actionsInProgress would have been decremented for those already, and now 
> we're going to decrement by all actions. This causes an assertion error since 
> we go negative 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
>  causing the HBase client to throw an unchecked exception which can kill the 
> caller thread that invoked the operation which should have timed out, as 
> callers of the client should not be catching {{Error}} and its subclasses 
> like {{AssertionError}}. 
> We still want to fail all actions, because none will be executed. But we need 
> special handling to avoid this case. Maybe don't bother decrementing the 
> actionsInProgress at all, instead set to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to