I'd like to add a couple details which I've only recently uncovered:
- The part of the alter which causes the error is `MIN_VERSIONS`. If I
apply just the `VERSIONS` and `TTL` portions, I don't observe these errors
(though this doesn't preserve some behavior that I care about.)
- The table in question has a somewhat large number of column qualifiers.
The tables where I mentioned we had previously applied very similar changes
had only a small fixed set of qualifiers. In principle, I understand that
this might mean that the RS has to do more work to enforce constraints on
the number of versions. But I don't understand why this would cause things
to break for `MIN_VERSIONS` but be fine for (max) `VERSIONS`, nor do I
understand why that would surface as 'Not seeked" states.

On Mon, May 13, 2019 at 1:19 PM Aaron Beppu <[email protected]> wrote:

> Hey HBase users,
>
> I've been struggling with a weird issue. Our team has a table which
> currently has a large number of versions per row, and we're seeking to
> apply a schema change which both constrains the number and age of versions
> stored:
> ```
> alter 'api_grains', {NAME => 'g', MIN_VERSIONS => 5, VERSIONS => 500, TTL
> => 7257600},  {NAME => 'isg', MIN_VERSIONS => 5, VERSIONS => 500, TTL =>
> 7257600}
> ```
> When attempting to apply a schema change to a large table on a 5.2.0
> (CDH5) cluster, the alter seems to be applied across all regions without
> problems, but almost immediately after finishing, I consistently see the
> region servers surface the following error.
>
> ```
>
> Unexpected throwable object
> org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$NotSeekedException: Not 
> seeked to a key/value
>       at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.assertSeeked(AbstractHFileReader.java:313)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:878)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:588)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5775)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5931)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5709)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5685)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5671)
>       at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6904)
>       at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6862)
>       at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2010)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33644)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)
>
> ```
>
> i.e., it seems to have not appropriately set up scanners to read its own
> HFiles. This occurs in the logs from many RSs in the cluster, and happens
> continuously. This breaks the service which queries this table, and
> continues until I restore a snapshot from before the schema change. The
> issue is reproducible (I've caused it about 8 times in our preprod
> environments), and is always resolved if I restore a snapshot from before
> the schema change.
>
> During the period where region servers throw these exceptions, I don't see
> any other indications that Hbase is in poor health. There are no regions in
> transition, hbck doesn't report anything interesting, and other tables seem
> unaffected.
>
> Just to confirm that the issue is not actually about the HFiles themselves
> being malformed, I took a snapshot from the table while it was in the
> "broken" state. After exporting this to a different environment, I
> confirmed that at a minimum, I can run spark or Hadoop jobs which run over
> the files in the snapshot without encountering any issues. So I believe
> that the files themselves are fine, because they're readable by HFile input
> formats.
>
> A further source of confusion is that we have recently done extremely
> similar `alter table ...` commands for other tables in the same cluster,
> without issue.
>
> If anyone can comment on how the region servers might into such a state
> (where it doesn't appropriately initialize and seek an HFile reader), or
> how that state would be related to specific  table admin operations, please
> share any insights you may have.
>
> I understand that due to the older version we're running it may be
> tempting to recommend that we upgrade to 2.1 and report back if our issue
> is unresolved. Please understand that we're running large cluster which
> support some high throughput, customer-facing services and that such a
> migration is a substantial project. If you make such a recommendation,
> please point to a specific issue or bug which has been resolved in more
> recent versions.
>
> Thanks,
> Aaron
>

Reply via email to