[ 
https://issues.apache.org/jira/browse/HBASE-29633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emil Kleszcz updated HBASE-29633:
---------------------------------
       Attachment: HBASE-29633.patch
    Fix Version/s: 2.5.10
           Status: Patch Available  (was: Open)

Added a patch that helps to ignore this error and works for scans and deleteall 
against the corrupted hbase:meta.

> Non-monotonic hbase:meta cell versions trigger ScanWildcardColumnTracker 
> exception and block scans
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29633
>                 URL: https://issues.apache.org/jira/browse/HBASE-29633
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.10
>            Reporter: Emil Kleszcz
>            Priority: Critical
>             Fix For: 2.5.10
>
>         Attachments: HBASE-29633.patch
>
>
> *Context*
> Clusters can end up with _hbase:meta_ rows that contain multiple 
> _info:regioninfo_ versions with out-of-order timestamps.
> This can happen when corrupted edits are inserted manually or during rare 
> replication/compaction edge cases.
> When the _hbase:meta_ scanner encounters such a row, region servers throw an 
> exception while iterating qualifiers, causing client scanners to close 
> unexpectedly.
> This issue has been discovered after investigating another issue related to 
> corrupted meta entries that couldn't be removed from _hbase:meta__ due to 
> wrong validation of a single comma key rows reported in 
> https://issues.apache.org/jira/browse/HBASE-29554
> *Problem*
> When scanning {_}hbase:meta{_}, the RegionServer throws:
> {code:java}
> java.io.IOException: ScanWildcardColumnTracker.checkColumn ran into a column 
> actually smaller than the previous column: regioninfo
> {code}
> This surfaces to clients as:
> {code:java}
> org.apache.hadoop.hbase.exceptions.ScannerResetException:
> Scanner is closed on the server-side
> {code}
> Standard {_}flush{_}, {_}major_compact{_}, and _catalogjanitor_run_ do not 
> repair the row.
> Attempts to delete or rewrite the row using the Java client fail.
> Full error message:
> {code:java}
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.exceptions.ScannerResetException):
>  org.apache.hadoop.hbase.exceptions.ScannerResetException: Scanner is closed 
> on the server-side
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3757)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45006)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
> Caused by: java.io.IOException: ScanWildcardColumnTracker.checkColumn ran 
> into a column actually smaller than the previous column: regioninfo
>         at 
> org.apache.hadoop.hbase.regionserver.querymatcher.ScanWildcardColumnTracker.checkVersions(ScanWildcardColumnTracker.java:121)
>         at 
> org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.matchColumn(UserScanQueryMatcher.java:141)
>         at 
> org.apache.hadoop.hbase.regionserver.querymatcher.NormalUserScanQueryMatcher.match(NormalUserScanQueryMatcher.java:80)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:624)
>         at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:145)
>         at 
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.populateResult(RegionScannerImpl.java:342)
>         at 
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextInternal(RegionScannerImpl.java:513)
>         at 
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextRaw(RegionScannerImpl.java:278)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3402)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3668)
>         ... 5 more
> {code}
> *Steps to Reproduce:*
> 1. Insert a hbase:meta row with several info:regioninfo versions where one 
> has a lower timestamp than earlier entries.
> 2. Flush and major compact hbase:meta.
> 3. Run a scan with RAW => true, VERSIONS => 10 on the row.
> *Observed Behavior*
>  - Any HBase client scan over _hbase:meta_ fails once it reaches the 
> corrupted row.
>  - RegionServer logs show _ScanWildcardColumnTracker.checkColumn_ exception.
>  - Compaction does not reorder or drop the offending KeyValues.
> *Attempted Workaround / Patch*
> As a temporary measure, I replaced the exception in 
> {_}ScanWildcardColumnTracker.checkColumn{_}:
> {code:java}
> // Old
> throw new IOException("ScanWildcardColumnTracker.checkColumn ran into a 
> column actually "
>   + "smaller than the previous column: " + 
> Bytes.toStringBinary(CellUtil.cloneQualifier(cell)));
> // New (workaround)
> return ScanQueryMatcher.MatchCode.SKIP;
> {code}
> This allows the scan to skip the offending cell and continue.
> This is not a proper fix, and it merely unblocks scanning and should be 
> reviewed for any side-effects e.g. to prevent the real data corruption as 
> reported in https://issues.apache.org/jira/browse/HBASE-1715
> *Risks*
>  - Corrupted rows cannot be cleaned or compacted with the current tooling.
>  - HBCK2's fixMeta does not help.
>  - Without a safeguard, future accidental edits or replication bugs could 
> cause production outages.
> *Request / Suggested Action*
>  - Investigate why out-of-order qualifier timestamps cause 
> _ScanWildcardColumnTracker_ to throw instead of skipping.
>  - Provide an administrative tool or automatic repair path to rewrite or drop 
> the broken versions.
>  - Consider stricter checks to prevent insertion of _hbase:meta_ cells with 
> non-monotonic version ordering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to