I'd like to add a couple details which I've only recently uncovered: - The part of the alter which causes the error is `MIN_VERSIONS`. If I apply just the `VERSIONS` and `TTL` portions, I don't observe these errors (though this doesn't preserve some behavior that I care about.) - The table in question has a somewhat large number of column qualifiers. The tables where I mentioned we had previously applied very similar changes had only a small fixed set of qualifiers. In principle, I understand that this might mean that the RS has to do more work to enforce constraints on the number of versions. But I don't understand why this would cause things to break for `MIN_VERSIONS` but be fine for (max) `VERSIONS`, nor do I understand why that would surface as 'Not seeked" states.
On Mon, May 13, 2019 at 1:19 PM Aaron Beppu <[email protected]> wrote: > Hey HBase users, > > I've been struggling with a weird issue. Our team has a table which > currently has a large number of versions per row, and we're seeking to > apply a schema change which both constrains the number and age of versions > stored: > ``` > alter 'api_grains', {NAME => 'g', MIN_VERSIONS => 5, VERSIONS => 500, TTL > => 7257600}, {NAME => 'isg', MIN_VERSIONS => 5, VERSIONS => 500, TTL => > 7257600} > ``` > When attempting to apply a schema change to a large table on a 5.2.0 > (CDH5) cluster, the alter seems to be applied across all regions without > problems, but almost immediately after finishing, I consistently see the > region servers surface the following error. > > ``` > > Unexpected throwable object > org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$NotSeekedException: Not > seeked to a key/value > at > org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.assertSeeked(AbstractHFileReader.java:313) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:878) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:588) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5775) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5931) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5709) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5685) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5671) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6904) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6862) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2010) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33644) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) > > ``` > > i.e., it seems to have not appropriately set up scanners to read its own > HFiles. This occurs in the logs from many RSs in the cluster, and happens > continuously. This breaks the service which queries this table, and > continues until I restore a snapshot from before the schema change. The > issue is reproducible (I've caused it about 8 times in our preprod > environments), and is always resolved if I restore a snapshot from before > the schema change. > > During the period where region servers throw these exceptions, I don't see > any other indications that Hbase is in poor health. There are no regions in > transition, hbck doesn't report anything interesting, and other tables seem > unaffected. > > Just to confirm that the issue is not actually about the HFiles themselves > being malformed, I took a snapshot from the table while it was in the > "broken" state. After exporting this to a different environment, I > confirmed that at a minimum, I can run spark or Hadoop jobs which run over > the files in the snapshot without encountering any issues. So I believe > that the files themselves are fine, because they're readable by HFile input > formats. > > A further source of confusion is that we have recently done extremely > similar `alter table ...` commands for other tables in the same cluster, > without issue. > > If anyone can comment on how the region servers might into such a state > (where it doesn't appropriately initialize and seek an HFile reader), or > how that state would be related to specific table admin operations, please > share any insights you may have. > > I understand that due to the older version we're running it may be > tempting to recommend that we upgrade to 2.1 and report back if our issue > is unresolved. Please understand that we're running large cluster which > support some high throughput, customer-facing services and that such a > migration is a substantial project. If you make such a recommendation, > please point to a specific issue or bug which has been resolved in more > recent versions. > > Thanks, > Aaron >
