Bryan Beaudreault created HBASE-28260:
-----------------------------------------

             Summary: Possible data loss in WAL after RegionServer crash
                 Key: HBASE-28260
                 URL: https://issues.apache.org/jira/browse/HBASE-28260
             Project: HBase
          Issue Type: Bug
            Reporter: Bryan Beaudreault


We recently had a production incident:
 # RegionServer crashes, but local DataNode lives on
 # WAL lease recovery kicks in
 # Namenode reconstructs the block during lease recovery (which results in a 
new genstamp). It chooses the replica on the local DataNode as the primary.
 # Local DataNode reconstructs the block, so NameNode registers the new 
genstamp.
 # Local DataNode and the underlying host dies, before the new block could be 
replicated to other replicas.

This leaves us with a missing block, because the new genstamp block has no 
replicas. The old replicas still remain, but are considered corrupt due to 
GENSTAMP_MISMATCH.

Thankfully we were able to confirm that the length of the corrupt blocks were 
identical to the newly constructed and lost block. Further, the file in 
question was only 1 block. So we downloaded one of those corrupt block files 
and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in hdfs. 
So in this case we had no actual data loss, but it could have happened easily 
if the file was more than 1 block or the replicas weren't fully in sync prior 
to reconstruction.

In order to avoid this issue, we should avoid writing WAL blocks too the local 
datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to [~weichiu] 
for pointing this out.

During reading of WALs we already reorder blocks so as to avoid reading from 
the local datanode, but avoiding writing there altogether would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to