[ https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182907#comment-13182907 ]
Todd Lipcon commented on HDFS-2742: ----------------------------------- Looking into this has made me aware of a lurking can of worms... the summary of the issue is that we have to make sure the interleaving of block state transitions and messages from the datanode is maintained. The sequence needs to look like: 1. Block is allocated 2. Block reports may arrive with the block in RBW state 3. Block is "completed" 4. Block reports and blockReceived messages may arrive with the block in FINALIZED state. On the active node, this sequence is guaranteed since we only mark a block as "complete" once the minimum number of replicas has reported FINALIZED. And once any replica reports FINALIZED, we shouldn't see anymore RBW replicas, unless they're truly corrupt. On the standby node, though, the application of the edits are delayed until it reads the shared storage log. So it may receive step 2 and step 4 long before it even knows about the block. The trick is that we need to interleave them into the correct position in the edits stream. The issue in this JIRA is that, in the tip of the branch today, we are processing all queued messages after applying all the edits. So, if we received a block report with an RBW replica, it will be processed after the replica is already completed, thus swapping step 2 and 3 in the above sequence. This results in the block being marked as corrupt. If instead we try to process the queued messages as soon as we first hear about the block, we have the opposite problem -- step 3 and step 4 are switched. This causes problems for Safe Mode since it isn't properly accounting the number of complete blocks in that case. Hence the patch currently attached to this JIRA breaks TestHASafeMode. I spent the afternoon thinking about it, and the best solution I can come up with is the following: - rather than a single PendingDatanodeMessages queue, where we queue up the entire block report or blockReceived message, we should make the queueing more fine-grained. So, if we receive a block report, we can "open it up" and handle each block separately. For each block, we have a few cases: -- *correct state*: the replica has the right genstamp and a consistent state - eg an RBW replica for an in-progress block or a FINALIZED replica for a completed block. We can handle these immediately. -- *too-high genstamp*: the replica being reported has a higher generation stamp than what we think is current for the block. Queue it. -- *correct genstamp, wrong state*: eg a FINALIZED replica for an incomplete block. Queue it. - When replaying edits, check the queue whenever (a) a new block is created, (b) a block's genstamp is updated, (c) a block's completion state is changed -- if the block has just become complete, process any FINALIZED reports -- if the block has just been allocated or gen-stamp-bumped, process any RBW reports - During a failover, after we have completely caught up our namespace state, process all pending messages regardless of whether they are "consistent". This is kind of complicated, but I can't think of much better. The one nice advantage it brings is that we don't have to delay a large BR full of old blocks just because it happens to include just one new block. This should keep the standby "hotter" and avoid using a bunch of memory for queued messages. > HA: observed dataloss in replication stress test > ------------------------------------------------ > > Key: HDFS-2742 > URL: https://issues.apache.org/jira/browse/HDFS-2742 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node > Affects Versions: HA branch (HDFS-1623) > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-2742.txt, log-colorized.txt > > > The replication stress test case failed over the weekend since one of the > replicas went missing. Still diagnosing the issue, but it seems like the > chain of events was something like: > - a block report was generated on one of the nodes while the block was being > written - thus the block report listed the block as RBW > - when the standby replayed this queued message, it was replayed after the > file was marked complete. Thus it marked this replica as corrupt > - it asked the DN holding the corrupt replica to delete it. And, I think, > removed it from the block map at this time. > - That DN then did another block report before receiving the deletion. This > caused it to be re-added to the block map, since it was "FINALIZED" now. > - Replication was lowered on the file, and it counted the above replica as > non-corrupt, and asked for the other replicas to be deleted. > - All replicas were lost. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira