[ https://issues.apache.org/jira/browse/HDFS-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160601#comment-13160601 ]
Todd Lipcon commented on HDFS-1972: ----------------------------------- Here are some thoughts on this JIRA, hopefully organized in a way that's conducive to discussion: h2. Point 0: We should always err on the side of conservativism -- would rather introduce inefficiency rather than incorrectness. h2. Point 1: *Of the commands that come from NN to DN, the only one that's really dangerous is DNA_INVALIDATE*: - DNA_REGISTER is scoped to a single NN - DNA_TRANSFER, if done mistakenly, can only result in an extra replica, which would be cleaned up later - DNA_RECOVERBLOCK - should be idempotent (though we should investigate this separately) - DNA_ACCESSKEYUPDATE - this is logged to the edit log, so the new NN should also have the same keys (also need to check on this) DNA_INVALIDATE however could cause data loss if the NNs are both trying to deal with an over-replicated block, but make different decisions on which replicas to invalidate. h2. Point 2: *Because block invalidation is reported asynchronously, the new active during a failover will not be aware of pending deletions.* For example: - block has 2 replicas, user asks to reduce to 1 - NN1 asks DN1 to delete it - DN1 receives the command, deletes the replica, but does not do a deletion report yet - Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge the new active. - NN2 doesn't know that DN1 has deleted the block, since it hasn't received a deletion report yet - NN2 asks DN2 to delete it. DN2 complies because NN2 is active. - Now we have no replicas :( So, even with a completely consistent view of which is active, we have an issue. So, the solution to this issue isn't just to make a consistent switchover on each DN - *we need some stronger guarantee that prevents deletions during the period when the new NN's cluster view is out of date*. h2. Proposed solution to above: After a failover, the active NN should not issue any deletions until all pending deletions from prior NNs have been processed. So the fencing happens at two points: *In DN*: Upon deciding that some NN is newly active, it issues a promise to that NN that it will not accept any further commands from a previous NN. [1] *In NN*: Upon becoming active, it must not issue any deletions until: a) Every DN has issued the promise described above b) Every DN has also issued a block deletion report[2] which includes all pending or processed deletions. [1] this promise may have to be recorded in its on-disk state - see Scenario 3 below. [2] If we want to be really paranoid, we could wait for a full block report. h2. A few example scenarios: h3. Scenario 1: standard failover - block has 2 replicas, user asks to reduce to 1 - NN1 asks DN1 to delete it - DN1 receives the command, deletes the replica, but does not do a deletion report yet - Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge NN2 is active _and promise not to accept commands from NN1_. - _NN1 might try to continue to issue commands but they will be ignored_. - NN2 doesn't know that DN1 has deleted the block, since it hasn't sent a deletion report yet. - NN2 considers the block over-replicated, _but does not send an INVALIDATE command, because it hasn't gotten a deletion report from all DNs_. - DN1 sends its deletion report. DN2 sends a deletion report which is empty. - _Since NN2 has now received a deletion report from all DNs, it knows there's only one replica, and doesn't ask for the deletion_. h3. Scenario 2: cluster partition (NN1 and DN1 on one side of the cluster, NN2 and DN2 on the other side of the cluster) - block has 2 replicas, user asks to reduce to 1 - NN1 asks DN1 to delete it - DN1 receives the command, deletes the replica, but does not do a deletion report yet - Network partition happens. NN1 thinks it's still active for some period of time, but NN2 is the one that actually has access to the edit logs - DN1 deletes the block, and reports it to NN1. It tries to report to NN2 but it can't connect anymore due to partition. - _DN2 promises to NN2 that it won't accept commands from NN1. It doesn't talk to NN1 because it's partitioned_. - NN2 still considers DN1 and DN2 as alive because they haven't timed out yet. Thus it considers the block to be over-replicated. - _NN2 doesn't send deletion commands because it hasn't gotten a deletion report from all DNs after the failover event_. - Eventually NN2 considers DN1 dead. At this point, that replica is removed from the map, so the block is no longer considered over-replicated. - Thus, NN2 doesn't delete the good replica on DN2. h3. Scenario 3: DN restarts during split brain period (this scenario illustrates why I think we need to persistently record the promise about who is active) - block has 2 replicas, user asks to reduce to 1 - NN1 adds the block to DN1's invalidation queue, but it's backed up behind a bunch of other commands, so doesn't get issued yet. - Failover occurs, but NN1 still thinks it's active. - DN1 promises to NN2 not to accept commands from NN1. It sends an empty deletion report to NN2. Then, it crashes. - NN2 has received a deletion report from everyone, and asks DN2 to delete the block. It hasn't realized that DN1 is crashed yet. - DN2 deletes the block. - DN1 starts back up. When it comes back up, it talks to NN1 first (maybe it takes a while to connect to NN2 for some reason) -- ** Now, if we had saved the "promise" as part of persistent state, we could ignore NN1 and avoid this issue. Otherwise: -- NN1 still thinks it's active, and sends a command to DN1 to delete the block. DN1 does so. -- We lost the block :( Persisting the txid in the DN disks actually has another nice property for non-HA clusters -- if you accidentally restart the NN from an old snapshot of the filesystem state, the DNs can refuse to connect, or refuse to process deletions. Currently, in this situation, the DNs would connect and then delete all of the newer blocks. > HA: Datanode fencing mechanism > ------------------------------ > > Key: HDFS-1972 > URL: https://issues.apache.org/jira/browse/HDFS-1972 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, name-node > Reporter: Suresh Srinivas > Assignee: Todd Lipcon > > In high availability setup, with an active and standby namenode, there is a > possibility of two namenodes sending commands to the datanode. The datanode > must honor commands from only the active namenode and reject the commands > from standby, to prevent corruption. This invariant must be complied with > during fail over and other states such as split brain. This jira addresses > issues related to this, design of the solution and implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira