Benoit Sigoure created HBASE-29108:
--------------------------------------

             Summary: regionserver does not cleanup storefiles written to .tmp 
directory when failing to close the storefiles during compaction
                 Key: HBASE-29108
                 URL: https://issues.apache.org/jira/browse/HBASE-29108
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 2.5.10
            Reporter: Benoit Sigoure


Background:
 * When hbase performs a compaction, it writes the compaction result (1 or more 
storefiles) to a file in HDFS under 
{{/hbase/data/<namespace>/<table>/<region>/{*}.tmp{*}/<columnfamily>/<storefile>}}
 * Once the compaction succeeds, the storefile is renamed to 
{{/hbase/data/<namespace>/<table>/<region>/<columnfamily>/<storefile>}} (moved 
out of the .tmp directory to where storefiles should be stored and are read to 
serve client RPC's)
 * When compaction fails, in some cases cleanup is performed and the storefile 
under {{.tmp}} directory is cleaned up (deleted). However, in other cases the 
storefile is left to be under {{.tmp}} directory (e.g. when one of the 
datanodes where the storefile's last block was being written gets 
{{{}SIGKILL{}}}'ed)

Problem
 * In certain cases, a storefile under {{.tmp}} will contain 1 corrupt block 
replica and 1 good block replica (e.g. one replica will be corrupt due to 
reason {{GENSTAMP_MISMATCH}} - generation stamp differs AND/OR have a file 
length lower than the good replica's file lengths). Namenode will detect this 
block corruption and report it in its metrics
 * The corrupt blocks will remain corrupt and the good block replica will not 
be re-replicated to other datanodes to fix the corruption.
 * The storefile under .tmp remains "open" / under-construction by the 
regionserver.

Impact
 * *No* visible impact on hbase clients (storefiles under {{.tmp}} are not read 
to return data to clients).
 * This can trip up alerts & monitoring (there are corrupt blocks being 
reported by namenode that do not fix themselves until regions are 
reopened/regionservers restart)
 * Decommissioning of datanodes can get blocked indefinitely (a block that 
contains a corrupt replica but is part of a file that is still open does not 
get re-replicated to other datanodes even if a good replica is available, thus 
the datanode that has the only good replica of a block cannot be decommissioned)

Workaround
 * A region can be re-opened (e.g. by restarting a regionserver on which the 
region is open), which causes the region's {{.tmp}} directory to be deleted 
recursively once the region is opened again, removing all corrupt blocks and 
leftover storefiles.

 

This bug report was written by Tomas Baltrunas at Arista.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to