[jira] [Created] (HBASE-27484) FNFE on StoreFileScanner after a flush followed by a compaction

Wellington Chevreuil (Jira) Mon, 14 Nov 2022 16:15:58 -0800

Wellington Chevreuil created HBASE-27484:
--------------------------------------------


             Summary: FNFE on StoreFileScanner after a flush followed by a 
compaction
                 Key: HBASE-27484
                 URL: https://issues.apache.org/jira/browse/HBASE-27484
             Project: HBase
          Issue Type: Bug
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


One of our customers was running SyncTable from a 1.2 based cluster, where 
SyncTable map tasks were open scanners on a 2.4 based cluster for comparing the 
two clusters. Few of the map tasks failed with a DoNotRetryException caused by 
a FileNotFoundException blowing all the way up to the client:
{noformat}
Error: org.apache.hadoop.hbase.DoNotRetryIOException: 
org.apache.hadoop.hbase.DoNotRetryIOException: java.io.FileNotFoundException: 
open 
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
 at 7225 on 
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
 com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: 
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; 
Proxy: null), S3 Extended Request ID: 
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3712)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45819)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:392)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:140)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:359)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:339)
Caused by: java.io.FileNotFoundException: open 
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
 at 7225 on 
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
 com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: 
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; 
Proxy: null), S3 Extended Request ID: 
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
...
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:632)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:417)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.reopenAfterFlush(StoreScanner.java:1018)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:552)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7399)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7567)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7331)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3373)
 {noformat}

We can see on the RS logs that the above file got recently create as an outcome 
of a memstore flush, then compaction is triggered shortly:
{noformat}
2022-11-11 22:16:50,322 INFO 
org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed memstore data 
size=208.15 KB at sequenceid=4949703 (bloomFilter=false), 
to=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
2022-11-11 22:16:50,513 INFO org.apache.hadoop.hbase.regionserver.HStore: Added 
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0,
 entries=951, sequenceid=4949703, filesize=26.2 K
...
2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Starting compaction of 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 in 
xxxxxx,IT001E90506702\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1618275339031.4c53da8c2ab9b7d7a0d6046ef3bb701c.
2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Starting compaction of 
[s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/9a865e5e3ba74ab6b9597fe0e295fd8d,
 
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/adbe118d8c76482d8faed3170437be0b,
 
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0]
 into 
tmpdir=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/.tmp,
 totalSize=27.1 M
...
2022-11-11 22:16:54,469 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Completed compaction of 3 file(s) in 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 of 
4c53da8c2ab9b7d7a0d6046ef3bb701c into 
bc44bfd06a03429fb331ab4dc10dbb8d(size=27.1 M), total size for store is 4.9 G. 
This selection was in queue for 0sec, and took 3sec to execute.
{noformat}

I believe this is an unlucky scenario where the compaction discharger moved the 
compacted away files while the StoreFileScanner was getting created but had not 
updated the refCounter on the file reader yet. We couldn't reproduce this on a 
real cluster, but I could emulate it with a UT and artificially inducing a 
delay in the StoreFileScanner creation when creating scans not for compactions. 
One possible fix is to update the reader refCounter as soon we get the files 
for the StoreFileScanner instances we are creating.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HBASE-27484) FNFE on StoreFileScanner after a flush followed by a compaction

Reply via email to