[ 
https://issues.apache.org/jira/browse/HDFS-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480523#comment-13480523
 ] 

Jing Zhao commented on HDFS-4061:
---------------------------------

Nicholas, I checked the test output and guess maybe the test failure is caused 
by this:

When the NameNode invalides a block for a datanode D1 and remove the 
datanode-block pair from the blockMap, and before the invalidation request is 
sent to the datanode D1, the BlockManager#computeDataNodeWork also starts to 
work and schedule the replication to D1. So the invalidation and replication 
request will be sent to D1 at the same time. D1 will then ignore the 
replication request (also throws a ReplicaAlreadyExistsException), and delete 
the replica. Thus NN cannot receive the blockreceived msg from D1. And the 
testcast will timeout in 5min which is smaller than the timeout of 
PendingReplication request (usually 5~10 min).

I can file another jira to fix the testcase if you think it is correct.
                
> TestBalancer and TestUnderReplicatedBlocks need timeouts
> --------------------------------------------------------
>
>                 Key: HDFS-4061
>                 URL: https://issues.apache.org/jira/browse/HDFS-4061
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>             Fix For: 2.0.3-alpha
>
>         Attachments: hdfs-4061.txt
>
>
> Saw TestBalancer and TestUnderReplicatedBlocks timeout hard on a jenkins job 
> recently, let's annotate the relevant tests with timeouts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to