0.20: Block lost when multiple DNs trying to recover it to different genstamps
------------------------------------------------------------------------------

                 Key: HDFS-1260
                 URL: https://issues.apache.org/jira/browse/HDFS-1260
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 0.20-append
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Critical
             Fix For: 0.20-append


Saw this issue on a cluster where some ops people were doing network changes 
without shutting down DNs first. So, recovery ended up getting started at 
multiple different DNs at the same time, and some race condition occurred that 
caused a block to get permanently stuck in recovery mode. What seems to have 
happened is the following:
- FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
while the block in the volumeMap (and on filesystem) was genstamp 7093
- we find the block file and meta file based on block ID only, without 
comparing gen stamp
- we rename the meta file to the new genstamp _7094
- in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
wildcard GS, so it does *not* update volumeMap
- validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
exist in blocks map"

After this point, all future recovery attempts to that node fail in 
getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
(since the meta file got renamed above) and then fails since _7094 isn't in 
volumeMap in validateBlockMetadata

Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to