[ 
https://issues.apache.org/jira/browse/HDFS-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276794#comment-15276794
 ] 

Rushabh S Shah commented on HDFS-10372:
---------------------------------------

bq. The test expected that the message in exception on out.close() contains the 
name of failed volume (to which the replica was written) but it contained only 
info about live volume (data2).
When the client asked for locations for first block, namenode selected a 
datanode with any random storage info within that datanode.
Refer to {{DataStreamer.locateFollowingBlock(DatanodeInfo[] excludedNodes)}} 
method for more details.
When the client started writing to datanode, the datanode selects a volume 
accordingto  RoundRobinVolumeChoosingPolicy policy and it can select a storage 
which is different than what namenode has stored in its triplets.
When the datanode sends an IBR (with RECIEVING_BLOCK), the namenode will change 
the storage info in its triplets with the storage info which datanode reported.
But the change in storage info is not propogated back to client.
So the client still has stale storage info.
When the client tried to close the file, the datanode threw an exception (since 
the volume has gone bad) but since the client has stale storage info, it  saved 
the exception with the old storage info.
This is the reason why the test was flaky in the first place.
In my machine, the test finishes within 2 seconds. So the datanode didn't send 
any IBR and the storage info was not changed in namenode. 
But in the  jenkins build machines, the test ran for more than 8 seconds which 
gave datanode ample of time  to send an IBR.
[~iwasakims]: I hope this answers your question.



> Fix for failing TestFsDatasetImpl#testCleanShutdownOfVolume
> -----------------------------------------------------------
>
>                 Key: HDFS-10372
>                 URL: https://issues.apache.org/jira/browse/HDFS-10372
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.7.3
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-10372.patch
>
>
> TestFsDatasetImpl#testCleanShutdownOfVolume fails very often.
> We added more debug information in HDFS-10260 to find out why this test is 
> failing.
> Now I think I know the root cause of failure.
> I thought that {{LocatedBlock#getLocations()}} returns an array of 
> DatanodeInfo but now I realized that it returns an array of 
> DatandeStorageInfo (which is subclass of DatanodeInfo).
> In the test I intended to check whether the exception contains the xfer 
> address of the DatanodeInfo. Since {{DatanodeInfo#toString()}} method returns 
> the xfer address, I checked whether exception contains 
> {{DatanodeInfo#toString}} or not.
> But since  {{LocatedBlock#getLocations()}} returned an array of 
> DatanodeStorageInfo, it has storage info in the toString() implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to