[
https://issues.apache.org/jira/browse/HBASE-29633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emil Kleszcz updated HBASE-29633:
---------------------------------
Attachment: HBASE-29633.patch
Fix Version/s: 2.5.10
Status: Patch Available (was: Open)
Added a patch that helps to ignore this error and works for scans and deleteall
against the corrupted hbase:meta.
> Non-monotonic hbase:meta cell versions trigger ScanWildcardColumnTracker
> exception and block scans
> --------------------------------------------------------------------------------------------------
>
> Key: HBASE-29633
> URL: https://issues.apache.org/jira/browse/HBASE-29633
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.5.10
> Reporter: Emil Kleszcz
> Priority: Critical
> Fix For: 2.5.10
>
> Attachments: HBASE-29633.patch
>
>
> *Context*
> Clusters can end up with _hbase:meta_ rows that contain multiple
> _info:regioninfo_ versions with out-of-order timestamps.
> This can happen when corrupted edits are inserted manually or during rare
> replication/compaction edge cases.
> When the _hbase:meta_ scanner encounters such a row, region servers throw an
> exception while iterating qualifiers, causing client scanners to close
> unexpectedly.
> This issue has been discovered after investigating another issue related to
> corrupted meta entries that couldn't be removed from _hbase:meta__ due to
> wrong validation of a single comma key rows reported in
> https://issues.apache.org/jira/browse/HBASE-29554
> *Problem*
> When scanning {_}hbase:meta{_}, the RegionServer throws:
> {code:java}
> java.io.IOException: ScanWildcardColumnTracker.checkColumn ran into a column
> actually smaller than the previous column: regioninfo
> {code}
> This surfaces to clients as:
> {code:java}
> org.apache.hadoop.hbase.exceptions.ScannerResetException:
> Scanner is closed on the server-side
> {code}
> Standard {_}flush{_}, {_}major_compact{_}, and _catalogjanitor_run_ do not
> repair the row.
> Attempts to delete or rewrite the row using the Java client fail.
> Full error message:
> {code:java}
> Caused by:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.exceptions.ScannerResetException):
> org.apache.hadoop.hbase.exceptions.ScannerResetException: Scanner is closed
> on the server-side
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3757)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45006)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
> Caused by: java.io.IOException: ScanWildcardColumnTracker.checkColumn ran
> into a column actually smaller than the previous column: regioninfo
> at
> org.apache.hadoop.hbase.regionserver.querymatcher.ScanWildcardColumnTracker.checkVersions(ScanWildcardColumnTracker.java:121)
> at
> org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.matchColumn(UserScanQueryMatcher.java:141)
> at
> org.apache.hadoop.hbase.regionserver.querymatcher.NormalUserScanQueryMatcher.match(NormalUserScanQueryMatcher.java:80)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:624)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:145)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.populateResult(RegionScannerImpl.java:342)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextInternal(RegionScannerImpl.java:513)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextRaw(RegionScannerImpl.java:278)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3402)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3668)
> ... 5 more
> {code}
> *Steps to Reproduce:*
> 1. Insert a hbase:meta row with several info:regioninfo versions where one
> has a lower timestamp than earlier entries.
> 2. Flush and major compact hbase:meta.
> 3. Run a scan with RAW => true, VERSIONS => 10 on the row.
> *Observed Behavior*
> - Any HBase client scan over _hbase:meta_ fails once it reaches the
> corrupted row.
> - RegionServer logs show _ScanWildcardColumnTracker.checkColumn_ exception.
> - Compaction does not reorder or drop the offending KeyValues.
> *Attempted Workaround / Patch*
> As a temporary measure, I replaced the exception in
> {_}ScanWildcardColumnTracker.checkColumn{_}:
> {code:java}
> // Old
> throw new IOException("ScanWildcardColumnTracker.checkColumn ran into a
> column actually "
> + "smaller than the previous column: " +
> Bytes.toStringBinary(CellUtil.cloneQualifier(cell)));
> // New (workaround)
> return ScanQueryMatcher.MatchCode.SKIP;
> {code}
> This allows the scan to skip the offending cell and continue.
> This is not a proper fix, and it merely unblocks scanning and should be
> reviewed for any side-effects e.g. to prevent the real data corruption as
> reported in https://issues.apache.org/jira/browse/HBASE-1715
> *Risks*
> - Corrupted rows cannot be cleaned or compacted with the current tooling.
> - HBCK2's fixMeta does not help.
> - Without a safeguard, future accidental edits or replication bugs could
> cause production outages.
> *Request / Suggested Action*
> - Investigate why out-of-order qualifier timestamps cause
> _ScanWildcardColumnTracker_ to throw instead of skipping.
> - Provide an administrative tool or automatic repair path to rewrite or drop
> the broken versions.
> - Consider stricter checks to prevent insertion of _hbase:meta_ cells with
> non-monotonic version ordering.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)