[ https://issues.apache.org/jira/browse/HBASE-20878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542395#comment-16542395 ]
Allan Yang commented on HBASE-20878: ------------------------------------ uploaded a new patch, deleted the testing code. I think checking the server is OK. It is like a safe fence. If we close the regions, and the RS which the regions is still online, we can safely proceed to the next state. The chance of the RS crashes just before we do this check and region close is very very small, and if it happens, aborting the merge is still OK to do. On the other hand, checking the region itself, like checking the recovered.edit, is a bit of hack I think. What if we bring back DLR(distributed log replay) back and there is no recovered.edit. Still, it is open to discuss. Another question is, should we just aborting the merge, or we should retry just like you said above, [~Apache9]? > Data loss if merging regions while ServerCrashProcedure executing > ----------------------------------------------------------------- > > Key: HBASE-20878 > URL: https://issues.apache.org/jira/browse/HBASE-20878 > Project: HBase > Issue Type: Bug > Components: amv2 > Affects Versions: 3.0.0, 2.1.0, 2.0.1 > Reporter: Allan Yang > Assignee: Allan Yang > Priority: Critical > Fix For: 3.0.0, 2.0.2, 2.1.1 > > Attachments: HBASE-20878.branch-2.0.001.patch, > HBASE-20878.branch-2.0.002.patch > > > In MergeTableRegionsProcedure, we close the regions to merge using > UnassignProcedure. But, if the RS these regions on is crashed, a > ServerCrashProcedure will execute at the same time. UnassignProcedures will > be blockd until all logs are split. But since these regions are closed for > merging, the regions won't open again, the recovered.edit in the region dir > won't be replay, thus, data will loss. > I provided a test to repo this case. I seriously doubt Split region procedure > also has this kind of problem. I will check later -- This message was sent by Atlassian JIRA (v7.6.3#76005)