thangTang commented on a change in pull request #1071: HBASE-23693 Split failure may cause region hole and data loss when use zk assign URL: https://github.com/apache/hbase/pull/1071#discussion_r368797626
########## File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java ########## @@ -783,10 +801,20 @@ public void regionOffline( LOG.info("Found region in " + state + " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); - } else if(state.isSplittingNew() || state.isMergingNew()) { - LOG.info("Offline/Cleanup region if no meta entry exists, hri: " + hri + - " state: " + state); - regionsToClean.add(state.getRegion()); + } else if (state.isSplittingNew() || state.isMergingNew()) { + LOG.info( + "Offline/Cleanup region if no meta entry exists, hri: " + hri + " state: " + state); + if (daughter2Parent.containsKey(hri.getEncodedName())) { + HRegionInfo parent = daughter2Parent.get(hri.getEncodedName()); + HRegionInfo info = getHRIFromMeta(parent); + if (info != null && info.isSplit() && info.isOffline()) { + regionsToClean.add(Pair.newPair(state.getRegion(), info)); Review comment: > > If the RS machine crash when the SplitTransactionImpl step after PONR, master will handle the split rollback. > > Yeah, but it seems to work only because Master has different info for the region than what is actual in meta. What if Active Master also crashes before it triggers SCP and daughters get cleaned up? The other master will load regions state as seen in meta, then it can run into same problem again. daughter cleanup is excuted in SCP, and in this patch it will do three things together: 1. delete daughter region in meta 2. update parent region in meta 3. delete daughter region dir in HDFS ``` if (regionPair != null) { MetaTableAccessor.deleteRegion(this.server.getConnection(), hri); MetaTableAccessor.deleteRegion(this.server.getConnection(), hri); } } if (parentInfo != null) { List<Mutation> mutations = new ArrayList<Mutation>(); HRegionInfo copyOfParent = new HRegionInfo(parentInfo); copyOfParent.setOffline(false); copyOfParent.setSplit(false); Put putParent = MetaTableAccessor.makePutFromRegionInfo(copyOfParent); mutations.add(putParent); MetaTableAccessor.mutateMetaTable(this.server.getConnection(), mutations); } LOG.debug("Cleaning up HDFS since no meta entry exists, hri: " + hri); LOG.debug("Cleaning up HDFS since no meta entry exists, hri: " + hri); FSUtils.deleteRegionDir(server.getConfiguration(), hri); FSUtils.deleteRegionDir(server.getConfiguration(), hri); ``` So if Active Master also crashes before it triggers SCP, the daughter won`t be deleted. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services