[ https://issues.apache.org/jira/browse/HBASE-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wellington Chevreuil resolved HBASE-27484. ------------------------------------------ Resolution: Fixed Merged to master, branch-2, branch-2.5 and branch-2.4. Thanks for reviewing it, [~psomogyi] ! > FNFE on StoreFileScanner after a flush followed by a compaction > --------------------------------------------------------------- > > Key: HBASE-27484 > URL: https://issues.apache.org/jira/browse/HBASE-27484 > Project: HBase > Issue Type: Bug > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > > One of our customers was running SyncTable from a 1.2 based cluster, where > SyncTable map tasks were open scanners on a 2.4 based cluster for comparing > the two clusters. Few of the map tasks failed with a DoNotRetryException > caused by a FileNotFoundException blowing all the way up to the client: > {noformat} > Error: org.apache.hadoop.hbase.DoNotRetryIOException: > org.apache.hadoop.hbase.DoNotRetryIOException: java.io.FileNotFoundException: > open > s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0 > at 7225 on > s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0: > com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does > not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; > Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: > wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; > Proxy: null), S3 Extended Request ID: > wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3712) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45819) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:392) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:140) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:359) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:339) > Caused by: java.io.FileNotFoundException: open > s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0 > at 7225 on > s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0: > com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does > not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; > Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: > wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; > Proxy: null), S3 Extended Request ID: > wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey > ... > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:632) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:417) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.reopenAfterFlush(StoreScanner.java:1018) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:552) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7399) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7567) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7331) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3373) > {noformat} > We can see on the RS logs that the above file got recently create as an > outcome of a memstore flush, then compaction is triggered shortly: > {noformat} > 2022-11-11 22:16:50,322 INFO > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed memstore > data size=208.15 KB at sequenceid=4949703 (bloomFilter=false), > to=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0 > 2022-11-11 22:16:50,513 INFO org.apache.hadoop.hbase.regionserver.HStore: > Added > s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0, > entries=951, sequenceid=4949703, filesize=26.2 K > ... > 2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction of 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 in > xxxxxx,IT001E90506702\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1618275339031.4c53da8c2ab9b7d7a0d6046ef3bb701c. > 2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HStore: > Starting compaction of > [s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/9a865e5e3ba74ab6b9597fe0e295fd8d, > > s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/adbe118d8c76482d8faed3170437be0b, > > s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0] > into > tmpdir=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/.tmp, > totalSize=27.1 M > ... > 2022-11-11 22:16:54,469 INFO org.apache.hadoop.hbase.regionserver.HStore: > Completed compaction of 3 file(s) in 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 of > 4c53da8c2ab9b7d7a0d6046ef3bb701c into > bc44bfd06a03429fb331ab4dc10dbb8d(size=27.1 M), total size for store is 4.9 G. > This selection was in queue for 0sec, and took 3sec to execute. > {noformat} > I believe this is an unlucky scenario where the compaction discharger moved > the compacted away files while the StoreFileScanner was getting created but > had not updated the refCounter on the file reader yet. We couldn't reproduce > this on a real cluster, but I could emulate it with a UT and artificially > inducing a delay in the StoreFileScanner creation when creating scans not for > compactions. One possible fix is to update the reader refCounter as soon we > get the files for the StoreFileScanner instances we are creating. -- This message was sent by Atlassian Jira (v8.20.10#820010)