[ 
https://issues.apache.org/jira/browse/HDFS-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160601#comment-13160601
 ] 

Todd Lipcon commented on HDFS-1972:
-----------------------------------

Here are some thoughts on this JIRA, hopefully organized in a way that's 
conducive to discussion:

h2. Point 0:
We should always err on the side of conservativism -- would rather introduce 
inefficiency rather than incorrectness.


h2. Point 1:

*Of the commands that come from NN to DN, the only one that's really dangerous 
is DNA_INVALIDATE*:
- DNA_REGISTER is scoped to a single NN
- DNA_TRANSFER, if done mistakenly, can only result in an extra replica, which 
would be cleaned up later
- DNA_RECOVERBLOCK - should be idempotent (though we should investigate this 
separately)
- DNA_ACCESSKEYUPDATE - this is logged to the edit log, so the new NN should 
also have the same keys (also need to check on this)

DNA_INVALIDATE however could cause data loss if the NNs are both trying to deal 
with an over-replicated block, but make different decisions on which replicas 
to invalidate.


h2. Point 2:

*Because block invalidation is reported asynchronously, the new active during a 
failover will not be aware of pending deletions.* For example:

- block has 2 replicas, user asks to reduce to 1
- NN1 asks DN1 to delete it
- DN1 receives the command, deletes the replica, but does not do a deletion 
report yet
- Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge the new 
active.
- NN2 doesn't know that DN1 has deleted the block, since it hasn't received a 
deletion report yet
- NN2 asks DN2 to delete it. DN2 complies because NN2 is active.
- Now we have no replicas :(

So, even with a completely consistent view of which is active, we have an 
issue. So, the solution to this issue isn't just to make a consistent 
switchover on each DN - *we need some stronger guarantee that prevents 
deletions during the period when the new NN's cluster view is out of date*.


h2. Proposed solution to above:

After a failover, the active NN should not issue any deletions until all 
pending deletions from prior NNs have been processed. So the fencing happens at 
two points:

*In DN*: Upon deciding that some NN is newly active, it issues a promise to 
that NN that it will not accept any further commands from a previous NN. [1]
*In NN*: Upon becoming active, it must not issue any deletions until:
a) Every DN has issued the promise described above
b) Every DN has also issued a block deletion report[2] which includes all 
pending or processed deletions.

[1] this promise may have to be recorded in its on-disk state - see Scenario 3 
below.
[2] If we want to be really paranoid, we could wait for a full block report.


h2. A few example scenarios:

h3. Scenario 1: standard failover

- block has 2 replicas, user asks to reduce to 1
- NN1 asks DN1 to delete it
- DN1 receives the command, deletes the replica, but does not do a deletion 
report yet
- Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge NN2 is active 
_and promise not to accept commands from NN1_.
- _NN1 might try to continue to issue commands but they will be ignored_.

- NN2 doesn't know that DN1 has deleted the block, since it hasn't sent a 
deletion report yet.
- NN2 considers the block over-replicated, _but does not send an INVALIDATE 
command, because it hasn't gotten a deletion report from all DNs_.
- DN1 sends its deletion report. DN2 sends a deletion report which is empty.
- _Since NN2 has now received a deletion report from all DNs, it knows there's 
only one replica, and doesn't ask for the deletion_.


h3. Scenario 2: cluster partition (NN1 and DN1 on one side of the cluster, NN2 
and DN2 on the other side of the cluster)

- block has 2 replicas, user asks to reduce to 1
- NN1 asks DN1 to delete it
- DN1 receives the command, deletes the replica, but does not do a deletion 
report yet
- Network partition happens. NN1 thinks it's still active for some period of 
time, but NN2 is the one that actually has access to the edit logs
- DN1 deletes the block, and reports it to NN1. It tries to report to NN2 but 
it can't connect anymore due to partition.

- _DN2 promises to NN2 that it won't accept commands from NN1. It doesn't talk 
to NN1 because it's partitioned_.
- NN2 still considers DN1 and DN2 as alive because they haven't timed out yet. 
Thus it considers the block to be over-replicated.
- _NN2 doesn't send deletion commands because it hasn't gotten a deletion 
report from all DNs after the failover event_.
- Eventually NN2 considers DN1 dead. At this point, that replica is removed 
from the map, so the block is no longer considered over-replicated.
- Thus, NN2 doesn't delete the good replica on DN2.

h3. Scenario 3: DN restarts during split brain period
(this scenario illustrates why I think we need to persistently record the 
promise about who is active)

- block has 2 replicas, user asks to reduce to 1
- NN1 adds the block to DN1's invalidation queue, but it's backed up behind a 
bunch of other commands, so doesn't get issued yet.
- Failover occurs, but NN1 still thinks it's active.
- DN1 promises to NN2 not to accept commands from NN1. It sends an empty 
deletion report to NN2. Then, it crashes.
- NN2 has received a deletion report from everyone, and asks DN2 to delete the 
block. It hasn't realized that DN1 is crashed yet.
- DN2 deletes the block.

- DN1 starts back up. When it comes back up, it talks to NN1 first (maybe it 
takes a while to connect to NN2 for some reason)
-- ** Now, if we had saved the "promise" as part of persistent state, we could 
ignore NN1 and avoid this issue. Otherwise:
-- NN1 still thinks it's active, and sends a command to DN1 to delete the 
block. DN1 does so.
-- We lost the block :(

Persisting the txid in the DN disks actually has another nice property for 
non-HA clusters -- if you accidentally restart the NN from an old snapshot of 
the filesystem state, the DNs can refuse to connect, or refuse to process 
deletions. Currently, in this situation, the DNs would connect and then delete 
all of the newer blocks.
                
> HA: Datanode fencing mechanism
> ------------------------------
>
>                 Key: HDFS-1972
>                 URL: https://issues.apache.org/jira/browse/HDFS-1972
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node, name-node
>            Reporter: Suresh Srinivas
>            Assignee: Todd Lipcon
>
> In high availability setup, with an active and standby namenode, there is a 
> possibility of two namenodes sending commands to the datanode. The datanode 
> must honor commands from only the active namenode and reject the commands 
> from standby, to prevent corruption. This invariant must be complied with 
> during fail over and other states such as split brain. This jira addresses 
> issues related to this, design of the solution and implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to