[jira] [Commented] (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2011-09-09 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101827#comment-13101827
 ] 

Suresh Srinivas commented on HDFS-1260:
---

+1 for the patch.

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: HDFS-1260-20S.3.patch, hdfs-1260.txt, hdfs-1260.txt, 
> simultaneous-recoveries.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2011-10-04 Thread Matt Foley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120351#comment-13120351
 ] 

Matt Foley commented on HDFS-1260:
--

Todd, do we need this in trunk also?

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20.205.0
>
> Attachments: HDFS-1260-20S.3.patch, hdfs-1260.txt, hdfs-1260.txt, 
> simultaneous-recoveries.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2011-10-04 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120459#comment-13120459
 ] 

Todd Lipcon commented on HDFS-1260:
---

I don't believe so - I think the new append design in trunk prevents this 
issue. There is an existing open JIRA against trunk about forward-porting all 
append-related test cases to the new trunk implementation to be sure the new 
design doesn't suffer from the same issues.

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20.205.0
>
> Attachments: HDFS-1260-20S.3.patch, hdfs-1260.txt, hdfs-1260.txt, 
> simultaneous-recoveries.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-22 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881490#action_12881490
 ] 

Todd Lipcon commented on HDFS-1260:
---

To confirm the suspicion above, I had the operator rename the meta block back 
to _7094, and the next recovery attempt succeeded.

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-22 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881491#action_12881491
 ] 

Todd Lipcon commented on HDFS-1260:
---

err.,.. sorry... rename the meta block back to *_7093*

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-23 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881985#action_12881985
 ] 

sam rash commented on HDFS-1260:


about the testing, any reason not to use one of the adapters instead of making 
this public?

{code}
public long nextGenerationStampForBlock(Block block) throws IOException  {
{code}

sorry, i'm a stickler for visibility/encapsulation bits when i can be

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1260.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-23 Thread Nicolas Spiegelberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881986#action_12881986
 ] 

Nicolas Spiegelberg commented on HDFS-1260:
---

+1 on the fix.  good catch

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1260.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-23 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881988#action_12881988
 ] 

sam rash commented on HDFS-1260:


oh, other than that, lgtm


> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1260.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-23 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881991#action_12881991
 ] 

sam rash commented on HDFS-1260:


yea, looks good.  at some point, does it make sense to move the DelayAnswer 
class out?  it seems generally useful (not this patch, but just thinking)


> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1260.txt, hdfs-1260.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881993#action_12881993
 ] 

Todd Lipcon commented on HDFS-1260:
---

Yea, we could move it to a MockitoUtil class or something. Let's tackle that 
when we move all these tests forward to trunk (I plan to do that in July 
hopefully)

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1260.txt, hdfs-1260.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-06-24 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882089#action_12882089
 ] 

dhruba borthakur commented on HDFS-1260:


This patch looks perfect to me. +1

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1260.txt, hdfs-1260.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps

2010-08-25 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902652#action_12902652
 ] 

dhruba borthakur commented on HDFS-1260:


> However, when we then call updateBlockMap() it doesn't use a wildcard 
> generation stamp, 

yes, that seems to be a bug to me. Great catch!

> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> --
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1260.txt, hdfs-1260.txt, simultaneous-recoveries.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes 
> without shutting down DNs first. So, recovery ended up getting started at 
> multiple different DNs at the same time, and some race condition occurred 
> that caused a block to get permanently stuck in recovery mode. What seems to 
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, 
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without 
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* 
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not 
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in 
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock 
> (since the meta file got renamed above) and then fails since _7094 isn't in 
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.