Benoit Sigoure created HBASE-29108:
--------------------------------------
Summary: regionserver does not cleanup storefiles written to .tmp
directory when failing to close the storefiles during compaction
Key: HBASE-29108
URL: https://issues.apache.org/jira/browse/HBASE-29108
Project: HBase
Issue Type: Bug
Components: regionserver
Affects Versions: 2.5.10
Reporter: Benoit Sigoure
Background:
* When hbase performs a compaction, it writes the compaction result (1 or more
storefiles) to a file in HDFS under
{{/hbase/data/<namespace>/<table>/<region>/{*}.tmp{*}/<columnfamily>/<storefile>}}
* Once the compaction succeeds, the storefile is renamed to
{{/hbase/data/<namespace>/<table>/<region>/<columnfamily>/<storefile>}} (moved
out of the .tmp directory to where storefiles should be stored and are read to
serve client RPC's)
* When compaction fails, in some cases cleanup is performed and the storefile
under {{.tmp}} directory is cleaned up (deleted). However, in other cases the
storefile is left to be under {{.tmp}} directory (e.g. when one of the
datanodes where the storefile's last block was being written gets
{{{}SIGKILL{}}}'ed)
Problem
* In certain cases, a storefile under {{.tmp}} will contain 1 corrupt block
replica and 1 good block replica (e.g. one replica will be corrupt due to
reason {{GENSTAMP_MISMATCH}} - generation stamp differs AND/OR have a file
length lower than the good replica's file lengths). Namenode will detect this
block corruption and report it in its metrics
* The corrupt blocks will remain corrupt and the good block replica will not
be re-replicated to other datanodes to fix the corruption.
* The storefile under .tmp remains "open" / under-construction by the
regionserver.
Impact
* *No* visible impact on hbase clients (storefiles under {{.tmp}} are not read
to return data to clients).
* This can trip up alerts & monitoring (there are corrupt blocks being
reported by namenode that do not fix themselves until regions are
reopened/regionservers restart)
* Decommissioning of datanodes can get blocked indefinitely (a block that
contains a corrupt replica but is part of a file that is still open does not
get re-replicated to other datanodes even if a good replica is available, thus
the datanode that has the only good replica of a block cannot be decommissioned)
Workaround
* A region can be re-opened (e.g. by restarting a regionserver on which the
region is open), which causes the region's {{.tmp}} directory to be deleted
recursively once the region is opened again, removing all corrupt blocks and
leftover storefiles.
This bug report was written by Tomas Baltrunas at Arista.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)