[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004848#comment-18004848
]
Hudson commented on HBASE-27781:
Results for branch branch-2
[build #1295 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/]:
(x) *{color:red}-1 overall{color}*
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/General_20Nightly_20Build_20Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop 3.3.5 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop 3.3.6 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop 3.4.0 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test for HBase 2 {color}
(/) {color:green}+1 client integration test for 3.3.5 {color}
(/) {color:green}+1 client integration test for 3.3.6 {color}
(/) {color:green}+1 client integration test for 3.4.0 {color}
(/) {color:green}+1 client integration test for 3.4.1 {color}
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: Client
>Affects Versions: 2.6.0, 2.5.3
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 2.6.3, 2.5.12
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop over all actions still being processed in the
> groupAndSendMulti at the time of the operation timeout being exceeded and set
> them as failed. The problem is, some number of these actions may have already
> failed to completion when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and throw off the (atomic) action counter accounting the
> sync client relies on to [tell when the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action in progress counter for all
> actions) in the groupAndSendMulti - which includes the aforementioned actions
> that were alrea
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004361#comment-18004361
]
Hudson commented on HBASE-27781:
Results for branch branch-2.6
[build #336 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/]:
(x) *{color:red}-1 overall{color}*
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop 3.3.5 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(x) {color:red}-1 jdk17 hadoop 3.3.6 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop 3.4.0 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test for HBase 2 {color}
(/) {color:green}+1 client integration test for 3.3.5 {color}
(/) {color:green}+1 client integration test for 3.3.6 {color}
(/) {color:green}+1 client integration test for 3.4.0 {color}
(/) {color:green}+1 client integration test for 3.4.1 {color}
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: Client
>Affects Versions: 2.6.0, 2.5.3
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 2.5.12, 2.6.4
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop over all actions still being processed in the
> groupAndSendMulti at the time of the operation timeout being exceeded and set
> them as failed. The problem is, some number of these actions may have already
> failed to completion when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and throw off the (atomic) action counter accounting the
> sync client relies on to [tell when the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action in progress counter for all
> actions) in the groupAndSendMulti - which includes the aforementioned actions
> that were
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004130#comment-18004130
]
Daniel Roudnitsky commented on HBASE-27781:
---
Thank you very much [~charlesconnell] for taking the time to review and merge !
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: Client
>Affects Versions: 2.6.0, 2.5.3
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 2.5.12, 2.6.4
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop over all actions still being processed in the
> groupAndSendMulti at the time of the operation timeout being exceeded and set
> them as failed. The problem is, some number of these actions may have already
> failed to completion when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and throw off the (atomic) action counter accounting the
> sync client relies on to [tell when the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action in progress counter for all
> actions) in the groupAndSendMulti - which includes the aforementioned actions
> that were already failed through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> - causing us to decrement the actions in progress counter more times than
> than there are actions if there was a location failure. This causes an
> assertion error in the actions in progress counter since we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
> and should never have a negative amount of actions in progress, causing the
> HBase client to throw an unchecked exception that is not handled within the
> client which bubbles up to the user application layer invoking the client,
> which may kill the caller thread/application that invoked the operation that
> should have timed out with a RetriesExhaustedWithDetails exception (rather
> than throwing an unchecked AssertionError), as the user application layer
> should not be catching {{Error}} and its subclasses like
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta
> slowness when running batch operations. Suppose we have a batch with 3
> actions, and on trying to resolve the location for the first action, we
> timeout repeatedly to the meta table due to meta slowness and consume the
> entire operation timeout on the meta scan attempts to resolve the location of
> the first action. In this case, we will fail the first action through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> which bring the actionsInProgress counter to 2, and then we will [loop over
> all three
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> and fail each of them, on the third action failure attempt the actions in
> progress counter is zero and we attempt to decrement it to -1, and hit the
> assertion error. This is what the test case in the PR successfully
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004122#comment-18004122
]
Charles Connell commented on HBASE-27781:
-
Merged PR into branch-2. Cherry-picked into branch-2.5 and branch-2.6.
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: Client
>Affects Versions: 2.6.0, 2.5.3
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 2.5.12, 2.6.4
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop over all actions still being processed in the
> groupAndSendMulti at the time of the operation timeout being exceeded and set
> them as failed. The problem is, some number of these actions may have already
> failed to completion when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and throw off the (atomic) action counter accounting the
> sync client relies on to [tell when the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action in progress counter for all
> actions) in the groupAndSendMulti - which includes the aforementioned actions
> that were already failed through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> - causing us to decrement the actions in progress counter more times than
> than there are actions if there was a location failure. This causes an
> assertion error in the actions in progress counter since we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
> and should never have a negative amount of actions in progress, causing the
> HBase client to throw an unchecked exception that is not handled within the
> client which bubbles up to the user application layer invoking the client,
> which may kill the caller thread/application that invoked the operation that
> should have timed out with a RetriesExhaustedWithDetails exception (rather
> than throwing an unchecked AssertionError), as the user application layer
> should not be catching {{Error}} and its subclasses like
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta
> slowness when running batch operations. Suppose we have a batch with 3
> actions, and on trying to resolve the location for the first action, we
> timeout repeatedly to the meta table due to meta slowness and consume the
> entire operation timeout on the meta scan attempts to resolve the location of
> the first action. In this case, we will fail the first action through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> which bring the actionsInProgress counter to 2, and then we will [loop over
> all three
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> and fail each of them, on the third action failure attempt the actions in
> progress counter is zero and we attempt to decrement it to -1, and hit the
> assertion error. This is what the test case in the PR successfully
> reproduces.
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956796#comment-17956796
]
Daniel Roudnitsky commented on HBASE-27781:
---
I have updated the jira to include more background on how
AsyncRequestFutureImpl works and more detail about the nature of the bug and
the test case/reproduction
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: asyncclient
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop over all actions still being processed in the
> groupAndSendMulti at the time of the operation timeout being exceeded and set
> them as failed. The problem is, some number of these actions may have already
> failed to completion when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and throw off the (atomic) action counter accounting the
> sync client relies on to [tell when the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action in progress counter for all
> actions) in the groupAndSendMulti - which includes the aforementioned actions
> that were already failed through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> - causing us to decrement the actions in progress counter more times than
> than there are actions if there was a location failure. This causes an
> assertion error in the actions in progress counter since we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
> and should never have a negative amount of actions in progress, causing the
> HBase client to throw an unchecked exception that is not handled within the
> client which bubbles up to the user application layer invoking the client,
> which may kill the caller thread/application that invoked the operation that
> should have timed out with a RetriesExhaustedWithDetails exception (rather
> than throwing an unchecked AssertionError), as the user application layer
> should not be catching {{Error}} and its subclasses like
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta
> slowness when running batch operations. Suppose we have a batch with 3
> actions, and on trying to resolve the location for the first action, we
> timeout repeatedly to the meta table due to meta slowness and consume the
> entire operation timeout on the meta scan attempts to resolve the location of
> the first action. In this case, we will fail the first action through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> which bring the actionsInProgress counter to 2, and then we will [loop over
> all three
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> and fail each of them, on the third action failure attempt the actions in
> progress counter is zero and we attempt to decrement it to -1, and hit the
> assertion error. This is what
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954779#comment-17954779
]
Daniel Roudnitsky commented on HBASE-27781:
---
Hey [~ndimiduk] do you have any advice on finding a reviewer for this bug fix?
It's been a bit of a challenge , I have bugged the devlist on this one once in
the past without luck.
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: asyncclient
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop all actions and set them as failed. The problem
> is, some number of actions may already finished when we get to this spot. So
> the actionsInProgress would have been decremented for those already, and now
> we're going to decrement by all actions. This causes an assertion error since
> we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
> causing the HBase client to throw an unchecked exception which can kill the
> caller thread that invoked the operation which should have timed out, as
> callers of the client should not be catching {{Error}} and its subclasses
> like {{AssertionError}}.
> We still want to fail all actions, because none will be executed. But we need
> special handling to avoid this case. Maybe don't bother decrementing the
> actionsInProgress at all, instead set to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950027#comment-17950027
]
Daniel Roudnitsky commented on HBASE-27781:
---
[~ndimiduk] to clarify , my understanding is that in the context of
groupAndSendMulti where this bug is, already finished == failed to completion
due to failure to resolve region location. The actions that get passed to
groupAndSendMulti are incomplete when they get to groupAndSendMulti for region
location resolution and grouping by server, and the only time we "complete" an
action in groupAndSendMulti is when we fail an action due to failure to resolve
the region location (there is also the case of a replica action which could
concurrently succeed, but thats handled properly from what I have observed), so
these actions which we fail due to location resolution are the only ones we
have to take care not to "double fail" and hit this assertion error. In my
patch we don't fail all actions in the entire batch operation, we only fail the
actions in the sub-batch that is being processed in groupAndSendMulti at the
time of operation timeout being exceeded, any actions that were already
completed successfully are not failed, those responses are preserved.
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: asyncclient
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop all actions and set them as failed. The problem
> is, some number of actions may already finished when we get to this spot. So
> the actionsInProgress would have been decremented for those already, and now
> we're going to decrement by all actions. This causes an assertion error since
> we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
> causing the HBase client to throw an unchecked exception which can kill the
> caller thread that invoked the operation which should have timed out, as
> callers of the client should not be catching {{Error}} and its subclasses
> like {{AssertionError}}.
> We still want to fail all actions, because none will be executed. But we need
> special handling to avoid this case. Maybe don't bother decrementing the
> actionsInProgress at all, instead set to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950030#comment-17950030
]
Daniel Roudnitsky commented on HBASE-27781:
---
Less verbosely, fail all actions really means fail all actions being processed
in groupAndSendMulti at the time of the operation timeout being exceeded, none
of the actions being processed in groupAndSendMulti at the time of the
operation timeout have been successfully executed yet, and there is no time
remaining to execute them, so we fail them, but have to take care not to double
fail any actions else we hit an assertion error
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: asyncclient
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop all actions and set them as failed. The problem
> is, some number of actions may already finished when we get to this spot. So
> the actionsInProgress would have been decremented for those already, and now
> we're going to decrement by all actions. This causes an assertion error since
> we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
> causing the HBase client to throw an unchecked exception which can kill the
> caller thread that invoked the operation which should have timed out, as
> callers of the client should not be catching {{Error}} and its subclasses
> like {{AssertionError}}.
> We still want to fail all actions, because none will be executed. But we need
> special handling to avoid this case. Maybe don't bother decrementing the
> actionsInProgress at all, instead set to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949959#comment-17949959
]
Nick Dimiduk commented on HBASE-27781:
--
I don't understand why
bq. The problem is, some number of actions may already finished when we get to
this spot.
and
bq. We still want to fail all actions, because none will be executed.
both hold true. If some actions have succeeded, that implies to me that some
have been executed. Thus we don't want to fail all actions, only those that
have not finished.
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: asyncclient
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop all actions and set them as failed. The problem
> is, some number of actions may already finished when we get to this spot. So
> the actionsInProgress would have been decremented for those already, and now
> we're going to decrement by all actions. This causes an assertion error since
> we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
> causing the HBase client to throw an unchecked exception which can kill the
> caller thread that invoked the operation which should have timed out, as
> callers of the client should not be catching {{Error}} and its subclasses
> like {{AssertionError}}.
> We still want to fail all actions, because none will be executed. But we need
> special handling to avoid this case. Maybe don't bother decrementing the
> actionsInProgress at all, instead set to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946505#comment-17946505
]
Daniel Roudnitsky commented on HBASE-27781:
---
Hi [~bbeaudreault] sorry to bug you about this again, I have been struggling to
find a reviewer on this issue for some time, do you by any chance have the
bandwidth to take a look or have any recommendations on someone else I could
ask to take a look thats familiar with the sync client code? I would love to
get this into the 2.6.3 release
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop all actions and set them as failed. The problem
> is, some number of actions may already finished when we get to this spot. So
> the actionsInProgress would have been decremented for those already, and now
> we're going to decrement by all actions. This causes an assertion error since
> we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
> causing the HBase client to throw an unchecked exception which can kill the
> caller thread that invoked the operation which should have timed out, as
> callers of the client should not be catching {{Error}} and its subclasses
> like {{AssertionError}}.
> We still want to fail all actions, because none will be executed. But we need
> special handling to avoid this case. Maybe don't bother decrementing the
> actionsInProgress at all, instead set to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945509#comment-17945509
]
Daniel Roudnitsky commented on HBASE-27781:
---
Hoping to get this into 2.6.3 - added target fix version
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> ---
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Daniel Roudnitsky
>Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop all actions and set them as failed. The problem
> is, some number of actions may already finished when we get to this spot. So
> the actionsInProgress would have been decremented for those already, and now
> we're going to decrement by all actions. This causes an assertion error since
> we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
> causing the HBase client to throw an unchecked exception which can kill the
> caller thread that invoked the operation which should have timed out, as
> callers of the client should not be catching {{Error}} and its subclasses
> like {{AssertionError}}.
> We still want to fail all actions, because none will be executed. But we need
> special handling to avoid this case. Maybe don't bother decrementing the
> actionsInProgress at all, instead set to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[ https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888015#comment-17888015 ] Daniel Roudnitsky commented on HBASE-27781: --- Hi [~bbeaudreault] is this something you think you'd be able to kindly review? > AssertionError in AsyncRequestFutureImpl when timing out during location > resolution > --- > > Key: HBASE-27781 > URL: https://issues.apache.org/jira/browse/HBASE-27781 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Daniel Roudnitsky >Priority: Major > Labels: pull-request-available > > In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded > during location resolution > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]. > In that handling, we loop all actions and set them as failed. The problem > is, some number of actions may already finished when we get to this spot. So > the actionsInProgress would have been decremented for those already, and now > we're going to decrement by all actions. This causes an assertion error since > we go negative > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197] > We still want to fail all actions, because none will be executed. But we need > special handling to avoid this case. Maybe don't bother decrementing the > actionsInProgress at all, instead set to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[ https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885930#comment-17885930 ] Daniel Roudnitsky commented on HBASE-27781: --- My first proposed solution did not work consistently with AsyncRequestFutureImpl that has null results or replica actions, withdrew subtask HBASE-28771 , I have put up a new commit with a simpler approach of keeping track of the actions failed locally in groupAndSendMultiAction as we loop over currentActions and avoiding failing an already locally failed action a second time if operation timeout is exceeded. > AssertionError in AsyncRequestFutureImpl when timing out during location > resolution > --- > > Key: HBASE-27781 > URL: https://issues.apache.org/jira/browse/HBASE-27781 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Daniel Roudnitsky >Priority: Major > Labels: pull-request-available > > In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded > during location resolution > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]. > In that handling, we loop all actions and set them as failed. The problem > is, some number of actions may already finished when we get to this spot. So > the actionsInProgress would have been decremented for those already, and now > we're going to decrement by all actions. This causes an assertion error since > we go negative > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197] > We still want to fail all actions, because none will be executed. But we need > special handling to avoid this case. Maybe don't bother decrementing the > actionsInProgress at all, instead set to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[ https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871850#comment-17871850 ] Daniel Roudnitsky commented on HBASE-27781: --- Created a subtask HBASE-28771 and PR #6143 for the above mentioned support for non replica actions to AsyncRequestFutureImpl.isActionComplete, and created PR [#6144|https://github.com/apache/hbase/pull/6144] for this issue with the fix as described above which makes use of the AsyncRequestFutureImpl.isActionComplete change > AssertionError in AsyncRequestFutureImpl when timing out during location > resolution > --- > > Key: HBASE-27781 > URL: https://issues.apache.org/jira/browse/HBASE-27781 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Daniel Roudnitsky >Priority: Major > Labels: pull-request-available > > In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded > during location resolution > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]. > In that handling, we loop all actions and set them as failed. The problem > is, some number of actions may already finished when we get to this spot. So > the actionsInProgress would have been decremented for those already, and now > we're going to decrement by all actions. This causes an assertion error since > we go negative > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197] > We still want to fail all actions, because none will be executed. But we need > special handling to avoid this case. Maybe don't bother decrementing the > actionsInProgress at all, instead set to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution
[ https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871724#comment-17871724 ] Daniel Roudnitsky commented on HBASE-27781: --- My proposed solution that I've drafted and tested is to do a check for action completion when we loop over [actions to fail|http://example.com/] and avoid setting a location error for actions which already completed/failed and had their result set and action counter decremented for already to avoid the assertion error that comes from the double action decrementing that is currently possible in the loop. The current isActionComplete method we have in AsyncRequestFutureImpl that is needed for this kind of check is only designed to support replica actions, I plan to open a subtask/seperate PR for the small refactor to isActionComplete needed to support the non replica actions that we have to account for here. > AssertionError in AsyncRequestFutureImpl when timing out during location > resolution > --- > > Key: HBASE-27781 > URL: https://issues.apache.org/jira/browse/HBASE-27781 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Daniel Roudnitsky >Priority: Major > > In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded > during location resolution > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]. > In that handling, we loop all actions and set them as failed. The problem > is, some number of actions may already finished when we get to this spot. So > the actionsInProgress would have been decremented for those already, and now > we're going to decrement by all actions. This causes an assertion error since > we go negative > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197] > We still want to fail all actions, because none will be executed. But we need > special handling to avoid this case. Maybe don't bother decrementing the > actionsInProgress at all, instead set to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
