chaijunjie created HBASE-29660:
----------------------------------
Summary: submittedRegionProcedures data leak in HRegionServer when
region open failed.
Key: HBASE-29660
URL: https://issues.apache.org/jira/browse/HBASE-29660
Project: HBase
Issue Type: Bug
Components: Region Assignment
Reporter: chaijunjie
There are 2 cache/map to track region Procedure in HRegionServer...
https://github.com/apache/hbase/blob/e57552521e2c228e8ec8d09ebef4657ba8a2b1d9/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L268C3-L268C97
and
https://github.com/apache/hbase/blob/e57552521e2c228e8ec8d09ebef4657ba8a2b1d9/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L273
When RS want to submit region procedure, will ignore it.
https://github.com/apache/hbase/blob/e57552521e2c228e8ec8d09ebef4657ba8a2b1d9/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3596C1-L3596C47
But the executedRegionProcedures is a cache object, it will clean itself, but
submittedRegionProcedures will not, so when the region open failed on RS, it
will just return, see
https://github.com/apache/hbase/blob/e57552521e2c228e8ec8d09ebef4657ba8a2b1d9/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java#L146
Then the rs.finishRegionProcedure never called in this RS for this region...
Some times, the "MasterData" dir lost data in HDFS...I try to fix it, and
recreate a master region(Restart HMaster), but after that some region could not
open...(tigger by balancer/SCP)...just found these logs....then we need restart
many RegionServers...
I think we could call finishRegionProcedure in cleanUpAndReportFailure method
after report it to master succeed..And also could set submittedRegionProcedures
as a cache object not a map to avoid...
--
This message was sent by Atlassian Jira
(v8.20.10#820010)