[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846804#comment-17846804
 ] 

Csaba Ringhofer commented on HBASE-28595:
-

About the test scenario:
I would separate closing the connection and the other network issues.

Closed connections seem to be handled explicitly differently on client and 
server: the server throws an exception and closes the scanner if it detects 
that the connection is closed, while the client treats it as retriable without 
starting a new scanner. This makes it easier to reproduce the issue with 
closing connection.

The "exception/results are lost due to network issues" is more of a 
hypothetical issue, I don't have logs which show that it is actually happening. 
 It should be possible, but needs very specific timing.

I used the reproduction scenario with closing connections by [~MikaelSmith] to 
investigate the issue and verify the fix.

> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
>  Labels: pull-request-available
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but at the same time, the client gets an rpc connection 
> closed error and doesn't process the exception sent by the server. Client 
> then thinks it got a network error, which leads to retrying the RPC instead 
> of opening a new scanner. But then when the client retry reaches the server, 
> the server returns an empty ScanResponse instead of an error, leading to 
> closing the scanner on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539]
> 2nd call (retry of 1st) returns empty results:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403]
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214]
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367]
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846799#comment-17846799
 ] 

Csaba Ringhofer commented on HBASE-28595:
-

>a simple fix would be that, only record a closed scanner when it is closed 
>because of exhausted
I was thinking about the possibility of this error in the exhausted case.
1. last scan rpc arrives to server, rows are returned, 
more_results_in_region=false
2. the response is lost due to network issues and the scan rpc is retried
3. as the scanner was closed on the server side in 1., an empty result is 
returned
4. the client skips the rows returned in 1. and returns without an error

Generally repeat protection seems missing for the last scan RPC as 
next_call_seq can be no longer used on the server side.

The client side fix also helps in the above case. I agree that this should be 
mainly solved on server side, but I am not sure how to do it without causing 
side effects with older clients. My HBase experience is very limited and mainly 
focuses on the client side of branch-2.2.

[~zhangduo] what do you think of adding fix both on client and server side? 
This would ensure that the problem can't come up even if the client or server 
is old in some setup.



> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
>  Labels: pull-request-available
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but at the same time, the client gets an rpc connection 
> closed error and doesn't process the exception sent by the server. Client 
> then thinks it got a network error, which leads to retrying the RPC instead 
> of opening a new scanner. But then when the client retry reaches the server, 
> the server returns an empty ScanResponse instead of an error, leading to 
> closing the scanner on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539]
> 2nd call (retry of 1st) returns empty results:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403]
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214]
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367]
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846696#comment-17846696
 ] 

Csaba Ringhofer commented on HBASE-28595:
-

The related classes are already removed on master (HBASE-21723), so I uploaded 
a PR about the client side fix to branch-2:  
https://github.com/apache/hbase/pull/5904


> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
>  Labels: pull-request-available
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but at the same time, the client gets an rpc connection 
> closed error and doesn't process the exception sent by the server. Client 
> then thinks it got a network error, which leads to retrying the RPC instead 
> of opening a new scanner. But then when the client retry reaches the server, 
> the server returns an empty ScanResponse instead of an error, leading to 
> closing the scanner on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539]
> 2nd call (retry of 1st) returns empty results:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403]
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214]
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367]
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846668#comment-17846668
 ] 

Csaba Ringhofer edited comment on HBASE-28595 at 5/15/24 3:23 PM:
--

Thanks [~zhangduo] for the quick response!

Uploaded a patch based on branch-2.2 that fixes the issue on the client side:
https://github.com/csringhofer/hbase/commit/1090407bb2c78aaa839b56a5bbcd8c7ddbbd489f

Generally I think that a server side fix would be cleaner (ensuring that after 
exceptions all later scan calls on the same scanner also return an exception). 
But I was worried about compatibility issues and the client side fix seemed the 
safest.


was (Author: csringhofer):
Thanks [~zhangduo] for the quick response!

Uploaded a patch based on branch-2.2 that fixes the issue on the client side:
https://github.com/csringhofer/hbase/commit/c24c2eeb138cbcd15164a2890aaf408057584c66

Generally I think that a server side fix would be cleaner (ensuring that after 
exceptions all later scan calls on the same scanner also return an exception). 
But I was worried about compatibility issues and the client side fix seemed the 
safest.

> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but at the same time, the client gets an rpc connection 
> closed error and doesn't process the exception sent by the server. Client 
> then thinks it got a network error, which leads to retrying the RPC instead 
> of opening a new scanner. But then when the client retry reaches the server, 
> the server returns an empty ScanResponse instead of an error, leading to 
> closing the scanner on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539]
> 2nd call (retry of 1st) returns empty results:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403]
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214]
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367]
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846668#comment-17846668
 ] 

Csaba Ringhofer commented on HBASE-28595:
-

Thanks [~zhangduo] for the quick response!

Uploaded a patch based on branch-2.2 that fixes the issue on the client side:
https://github.com/csringhofer/hbase/commit/c24c2eeb138cbcd15164a2890aaf408057584c66

Generally I think that a server side fix would be cleaner (ensuring that after 
exceptions all later scan calls on the same scanner also return an exception). 
But I was worried about compatibility issues and the client side fix seemed the 
safest.

> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but at the same time, the client gets an rpc connection 
> closed error and doesn't process the exception sent by the server. Client 
> then thinks it got a network error, which leads to retrying the RPC instead 
> of opening a new scanner. But then when the client retry reaches the server, 
> the server returns an empty ScanResponse instead of an error, leading to 
> closing the scanner on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539]
> 2nd call (retry of 1st) returns empty results:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403]
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214]
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367]
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28595) Loosing exception from scan RPC can lead to partial results

2024-05-15 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created HBASE-28595:
---

 Summary: Loosing exception from scan RPC can lead to partial 
results
 Key: HBASE-28595
 URL: https://issues.apache.org/jira/browse/HBASE-28595
 Project: HBase
  Issue Type: Bug
  Components: Client, regionserver, Scanners
Reporter: Csaba Ringhofer


This was discovered in Apache Impala using HBase 2.2 based branch hbase client 
and server. It is not clear yet whether other branches are also affected.

The issue happens if the server side of the scan throws an exception and closes 
the scanner, but the client doesn't get the exact exception and it treats it as 
network error, which leads to retrying the RPC instead of opening a new 
scanner. In this case  the server returns an empty ScanResponse instead of an 
error when the RPC is retried, leading to closing the scanner on client side 
without returning any error.

A few pointers to critical parts:
region server:
1st call throws exception leading to closing (but not deleting) scanner:
https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539
2nd call (retry of 1st) returns empty results:
https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403

client:
some exceptions are handled as non-retriable at RPC level and are only handled 
through opening a new scanner:
https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214
https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367

This mechanism in the client only works if it gets the exception from the 
server. If there are connection issues during the RPC then the client won't 
really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated HBASE-28595:

Summary: Losing exception from scan RPC can lead to partial results  (was: 
Loosing exception from scan RPC can lead to partial results)

> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Priority: Critical
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but the client doesn't get the exact exception and it 
> treats it as network error, which leads to retrying the RPC instead of 
> opening a new scanner. In this case  the server returns an empty ScanResponse 
> instead of an error when the RPC is retried, leading to closing the scanner 
> on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539
> 2nd call (retry of 1st) returns empty results:
> https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214
> https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)