Wellington Chevreuil created HBASE-28884:
--------------------------------------------
Summary: SFT's BrokenStoreFileCleaner may cause data loss
Key: HBASE-28884
URL: https://issues.apache.org/jira/browse/HBASE-28884
Project: HBase
Issue Type: Bug
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
When having this BrokenStoreFileCleaner enabled, one of our customers has run
into a data loss situation, probably due to a race condition between regions
getting moved out of the regionserver while the BrokenStoreFileCleaner was
checking this region's files eligibility for deletion. We have seen that the
file got deleted by the given region server, around the same time the region
got closed on this region server. I believe a race condition during region
close is possible here:
1) In BrokenStoreFileCleaner, for each region online on the given RS, we get
the list of files in the store dirs, then iterate through it [1];
2) For each file listed, we perform several checks, including this one [2] that
checks if the file is "active"
The problem is, if the region for the file we are checking got closed between
point #1 and #2, by the time we check if the file is active in [2], the store
may have already been closed as part of the region closure, so this check would
consider the file as deletable.
One simple solution is to check if the store's region is still open before
proceeding with deleting the file.
[1]
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L99
[2]
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L133
--
This message was sent by Atlassian Jira
(v8.20.10#820010)