[ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182907#comment-13182907
 ] 

Todd Lipcon commented on HDFS-2742:
-----------------------------------

Looking into this has made me aware of a lurking can of worms... the summary of 
the issue is that we have to make sure the interleaving of block state 
transitions and messages from the datanode is maintained. The sequence needs to 
look like:

1. Block is allocated
2. Block reports may arrive with the block in RBW state
3. Block is "completed"
4. Block reports and blockReceived messages may arrive with the block in 
FINALIZED state.

On the active node, this sequence is guaranteed since we only mark a block as 
"complete" once the minimum number of replicas has reported FINALIZED. And once 
any replica reports FINALIZED, we shouldn't see anymore RBW replicas, unless 
they're truly corrupt.

On the standby node, though, the application of the edits are delayed until it 
reads the shared storage log. So it may receive step 2 and step 4 long before 
it even knows about the block. The trick is that we need to interleave them 
into the correct position in the edits stream.

The issue in this JIRA is that, in the tip of the branch today, we are 
processing all queued messages after applying all the edits. So, if we received 
a block report with an RBW replica, it will be processed after the replica is 
already completed, thus swapping step 2 and 3 in the above sequence. This 
results in the block being marked as corrupt.
If instead we try to process the queued messages as soon as we first hear about 
the block, we have the opposite problem -- step 3 and step 4 are switched. This 
causes problems for Safe Mode since it isn't properly accounting the number of 
complete blocks in that case. Hence the patch currently attached to this JIRA 
breaks TestHASafeMode.

I spent the afternoon thinking about it, and the best solution I can come up 
with is the following:
- rather than a single PendingDatanodeMessages queue, where we queue up the 
entire block report or blockReceived message, we should make the queueing more 
fine-grained. So, if we receive a block report, we can "open it up" and handle 
each block separately. For each block, we have a few cases:
-- *correct state*: the replica has the right genstamp and a consistent state - 
eg an RBW replica for an in-progress block or a FINALIZED replica for a 
completed block. We can handle these immediately.
-- *too-high genstamp*: the replica being reported has a higher generation 
stamp than what we think is current for the block. Queue it.
-- *correct genstamp, wrong state*: eg a FINALIZED replica for an incomplete 
block. Queue it.

- When replaying edits, check the queue whenever (a) a new block is created, 
(b) a block's genstamp is updated, (c) a block's completion state is changed
-- if the block has just become complete, process any FINALIZED reports
-- if the block has just been allocated or gen-stamp-bumped, process any RBW 
reports

- During a failover, after we have completely caught up our namespace state, 
process all pending messages regardless of whether they are "consistent".

This is kind of complicated, but I can't think of much better. The one nice 
advantage it brings is that we don't have to delay a large BR full of old 
blocks just because it happens to include just one new block. This should keep 
the standby "hotter" and avoid using a bunch of memory for queued messages.
                
> HA: observed dataloss in replication stress test
> ------------------------------------------------
>
>                 Key: HDFS-2742
>                 URL: https://issues.apache.org/jira/browse/HDFS-2742
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node, ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to