[ 
https://issues.apache.org/jira/browse/HBASE-30226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eungsop Yoo reassigned HBASE-30226:
-----------------------------------

    Assignee: Eungsop Yoo

> Reverse FuzzyRowFilter can stop making progress when the reverse seek hint is 
> equal to the current row
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-30226
>                 URL: https://issues.apache.org/jira/browse/HBASE-30226
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.12
>            Reporter: Eungsop Yoo
>            Assignee: Eungsop Yoo
>            Priority: Major
>              Labels: pull-request-available
>
> Observed on HBase 2.5.12. This likely affects all versions after 2.5.11 that 
> include the reverse FuzzyRowFilter hint adjustment from HBASE-28634. A 
> reverse Scan with FuzzyRowFilter can keep a RegionServer scan handler 
> RUNNABLE in FuzzyRowFilter / RowTracker. In the observed case, the scan queue 
> was empty, but active scan handlers remained CPU-bound.
> Hot thread stacks repeatedly showed:
> {noformat}
>   FuzzyRowFilter.getNextForFuzzyRule
>   FuzzyRowFilter$RowTracker.updateWith
>   FuzzyRowFilter$RowTracker.updateTracker
>   FuzzyRowFilter.getNextCellHint
>   UserScanQueryMatcher.getNextKeyHint
>   StoreScanner.next
>   RSRpcServices.scan
>   {noformat}
> The issue appears when a reverse seek hint does not move before the current 
> row. RowTracker can then keep revisiting the same row candidate.
> This is not caused by consecutive non-matching rows alone. The problematic 
> case is when the hint from one non-matching row points to an existing next
> non-matching row, and evaluating that row recreates the same-row hint.
> h3. Case explanation
> The important part is where the reverse hint sends the scanner.
> Does not reproduce with the original HBASE-28634 example:
>  * Filter: 1114??
>  * Table order in reverse: 111777 non-match, 111611 non-match, 111511 
> non-match, 111446 match
>  * Actual scan flow: 111777 -> hint 1115 -> seek -> 111446 match
>  * The scanner skips the intermediate non-matching rows, so RowTracker does 
> not enter the bad poll/add state.
> Reproduces:
>  * Filter: a?a
>  * Table order in reverse: abc non-match, abb non-match, aaa match
>  * Actual scan flow: abc -> hint abb -> seek -> abb
>  * Then abb -> hint abb again. RowTracker polls abb and adds abb again, so 
> updateTracker can loop.
> Does not reproduce with only two rows:
>  * Filter: a?a
>  * Table order in reverse: abb non-match, aaa match
>  * Actual scan flow: abb -> aaa match
>  * The scanner reaches a matching row before the bad RowTracker state is 
> triggered.
> h3. Reproduction with hbase shell
> {noformat}
>   import java.util.Arrays
>   import org.apache.hadoop.hbase.filter.FuzzyRowFilter
>   import org.apache.hadoop.hbase.util.Bytes
>   import org.apache.hadoop.hbase.util.Pair
>   create 'FUZZY_REVERSE_REPRO', 'f'
>   put 'FUZZY_REVERSE_REPRO', 'aaa', 'f:q1', 'v'
>   put 'FUZZY_REVERSE_REPRO', 'abb', 'f:q1', 'v'
>   put 'FUZZY_REVERSE_REPRO', 'abc', 'f:q1', 'v'
>   scan 'FUZZY_REVERSE_REPRO', {
>     REVERSED => true,
>     FILTER => FuzzyRowFilter.new(Arrays.asList(
>       Pair.new(Bytes.toBytesBinary('aaa'), 
> Bytes.toBytesBinary('\x00\x01\x00'))
>     ))
>   }
>   {noformat}
> The fuzzy rule is a?a. Row aaa matches. Rows abc and abb do not match. In 
> reverse order, abc is seen before abb. The hint for abc points to abb. When
> abb is evaluated next, the reverse hint is again abb, so RowTracker can enter 
> the poll/add loop even though each row has only one cell.
> h3. Expected
> The reverse scan skips abb, returns aaa, and finishes.
> h3. Actual
> On a vulnerable version, the scan does not return and the client eventually 
> times out. For example, with the default 60 second RPC timeout, hbase shell
> reports:
> {noformat}
>   java.net.SocketTimeoutException: callTimeout=60000, callDuration=60142: 
> Call to address=<regionserver>:16020 failed on local exception:
>   org.apache.hadoop.hbase.ipc.CallTimeoutException: 
> Call[id=33,methodName=Scan], waitTime=60010ms, rpcTimeout=60000ms
>   ERROR: Call[id=33,methodName=Scan], waitTime=60010ms, rpcTimeout=60000ms
>   {noformat}
> After the client timeout or disconnect, the RegionServer scan handler can 
> remain hot in the FuzzyRowFilter / RowTracker stack.
> h3. Cleanup
> {noformat}
>   disable 'FUZZY_REVERSE_REPRO'
>   drop 'FUZZY_REVERSE_REPRO'
>   {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to