[jira] [Commented] (HDFS-2770) Block reports may mark corrupt blocks pending deletion as non-corrupt

2012-04-01 Thread VinayaKumar B (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243676#comment-13243676
 ] 

VinayaKumar B commented on HDFS-2770:
-

Hi Todd,

I think corrupt replicas are invalidated only if the Number of good replicas 
more than or equal to replication. But you told it is invalidated immediately.
{code}// Add this replica to corruptReplicas Map
corruptReplicas.addToCorruptReplicasMap(storedBlock, node, reason);
if (countNodes(storedBlock).liveReplicas() >= inode.getReplication()) {
  // the block is over-replicated so invalidate the replicas immediately
  invalidateBlock(storedBlock, node);
} else if (namesystem.isPopulatingReplQueues()) {
  // add the block to neededReplication
  updateNeededReplications(storedBlock, -1, 0);
}{code}

If number of datanodes equal to replication, out of which one replica is marked 
corrupt, then that replica will never be deleted and replication also wont 
happen.
Same Issue in HDFS-2932.

> Block reports may mark corrupt blocks pending deletion as non-corrupt
> -
>
> Key: HDFS-2770
> URL: https://issues.apache.org/jira/browse/HDFS-2770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> It seems like HDFS-900 may have regressed in trunk since it was committed 
> without a regression test. In HDFS-2742 I saw the following sequence of 
> events:
> - A block at replication 2 had one of its replicas marked as corrupt on the NN
> - NN scheduled deletion of that replica in {{invalidateWork}}, and removed it 
> from the block map
> - The DN hosting that block sent a block report, which caused the replica to 
> get re-added to the block map as if it were good
> - The deletion request was passed to the DN and it deleted the block
> - Now we're in a bad state, where the NN temporarily thinks that it has two 
> good replicas, but in fact one of them has been deleted. If we lower 
> replication of this block at this time, the one good remaining replica may be 
> deleted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2770) Block reports may mark corrupt blocks pending deletion as non-corrupt

2012-01-08 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182377#comment-13182377
 ] 

Todd Lipcon commented on HDFS-2770:
---

I believe the issue may be with any place we check:
{code}
// Ignore replicas already scheduled to be removed from the DN
if(invalidateBlocks.contains(dn.getStorageID(), block)) {
{code}
since it is ignoring the fact that, after the replication monitor thread has 
run, the block is no longer in {{BlockManager.invalidateBlocks}}, but instead 
in that DatanodeDescriptor's {{invalidateBlocks}} list.

Maybe someone can remind me why we even have two separate invalidateBlocks 
structures in the first place? (one global map keyed by StorageID and another 
per-datanode list)

> Block reports may mark corrupt blocks pending deletion as non-corrupt
> -
>
> Key: HDFS-2770
> URL: https://issues.apache.org/jira/browse/HDFS-2770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> It seems like HDFS-900 may have regressed in trunk since it was committed 
> without a regression test. In HDFS-2742 I saw the following sequence of 
> events:
> - A block at replication 2 had one of its replicas marked as corrupt on the NN
> - NN scheduled deletion of that replica in {{invalidateWork}}, and removed it 
> from the block map
> - The DN hosting that block sent a block report, which caused the replica to 
> get re-added to the block map as if it were good
> - The deletion request was passed to the DN and it deleted the block
> - Now we're in a bad state, where the NN temporarily thinks that it has two 
> good replicas, but in fact one of them has been deleted. If we lower 
> replication of this block at this time, the one good remaining replica may be 
> deleted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira