[jira] [Resolved] (HBASE-27484) FNFE on StoreFileScanner after a flush followed by a compaction

Wellington Chevreuil (Jira) Tue, 29 Nov 2022 06:11:08 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wellington Chevreuil resolved HBASE-27484.
------------------------------------------
    Resolution: Fixed

Merged to master, branch-2, branch-2.5 and branch-2.4. Thanks for reviewing it, 
[~psomogyi] !

> FNFE on StoreFileScanner after a flush followed by a compaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-27484
>                 URL: https://issues.apache.org/jira/browse/HBASE-27484
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>
> One of our customers was running SyncTable from a 1.2 based cluster, where 
> SyncTable map tasks were open scanners on a 2.4 based cluster for comparing 
> the two clusters. Few of the map tasks failed with a DoNotRetryException 
> caused by a FileNotFoundException blowing all the way up to the client:
> {noformat}
> Error: org.apache.hadoop.hbase.DoNotRetryIOException: 
> org.apache.hadoop.hbase.DoNotRetryIOException: java.io.FileNotFoundException: 
> open 
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
>  at 7225 on 
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
>  com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does 
> not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; 
> Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: 
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; 
> Proxy: null), S3 Extended Request ID: 
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3712)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45819)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:392)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:140)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:359)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:339)
> Caused by: java.io.FileNotFoundException: open 
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
>  at 7225 on 
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
>  com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does 
> not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; 
> Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: 
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; 
> Proxy: null), S3 Extended Request ID: 
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
> ...
> at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:632)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:417)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reopenAfterFlush(StoreScanner.java:1018)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:552)
>         at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7399)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7567)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7331)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3373)
>  {noformat}
> We can see on the RS logs that the above file got recently create as an 
> outcome of a memstore flush, then compaction is triggered shortly:
> {noformat}
> 2022-11-11 22:16:50,322 INFO 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed memstore 
> data size=208.15 KB at sequenceid=4949703 (bloomFilter=false), 
> to=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
> 2022-11-11 22:16:50,513 INFO org.apache.hadoop.hbase.regionserver.HStore: 
> Added 
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0,
>  entries=951, sequenceid=4949703, filesize=26.2 K
> ...
> 2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
> Starting compaction of 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 in 
> xxxxxx,IT001E90506702\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1618275339031.4c53da8c2ab9b7d7a0d6046ef3bb701c.
> 2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HStore: 
> Starting compaction of 
> [s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/9a865e5e3ba74ab6b9597fe0e295fd8d,
>  
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/adbe118d8c76482d8faed3170437be0b,
>  
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0]
>  into 
> tmpdir=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/.tmp,
>  totalSize=27.1 M
> ...
> 2022-11-11 22:16:54,469 INFO org.apache.hadoop.hbase.regionserver.HStore: 
> Completed compaction of 3 file(s) in 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 of 
> 4c53da8c2ab9b7d7a0d6046ef3bb701c into 
> bc44bfd06a03429fb331ab4dc10dbb8d(size=27.1 M), total size for store is 4.9 G. 
> This selection was in queue for 0sec, and took 3sec to execute.
> {noformat}
> I believe this is an unlucky scenario where the compaction discharger moved 
> the compacted away files while the StoreFileScanner was getting created but 
> had not updated the refCounter on the file reader yet. We couldn't reproduce 
> this on a real cluster, but I could emulate it with a UT and artificially 
> inducing a delay in the StoreFileScanner creation when creating scans not for 
> compactions. One possible fix is to update the reader refCounter as soon we 
> get the files for the StoreFileScanner instances we are creating.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HBASE-27484) FNFE on StoreFileScanner after a flush followed by a compaction

Reply via email to