[ https://issues.apache.org/jira/browse/HDFS-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276794#comment-15276794 ]
Rushabh S Shah commented on HDFS-10372: --------------------------------------- bq. The test expected that the message in exception on out.close() contains the name of failed volume (to which the replica was written) but it contained only info about live volume (data2). When the client asked for locations for first block, namenode selected a datanode with any random storage info within that datanode. Refer to {{DataStreamer.locateFollowingBlock(DatanodeInfo[] excludedNodes)}} method for more details. When the client started writing to datanode, the datanode selects a volume accordingto RoundRobinVolumeChoosingPolicy policy and it can select a storage which is different than what namenode has stored in its triplets. When the datanode sends an IBR (with RECIEVING_BLOCK), the namenode will change the storage info in its triplets with the storage info which datanode reported. But the change in storage info is not propogated back to client. So the client still has stale storage info. When the client tried to close the file, the datanode threw an exception (since the volume has gone bad) but since the client has stale storage info, it saved the exception with the old storage info. This is the reason why the test was flaky in the first place. In my machine, the test finishes within 2 seconds. So the datanode didn't send any IBR and the storage info was not changed in namenode. But in the jenkins build machines, the test ran for more than 8 seconds which gave datanode ample of time to send an IBR. [~iwasakims]: I hope this answers your question. > Fix for failing TestFsDatasetImpl#testCleanShutdownOfVolume > ----------------------------------------------------------- > > Key: HDFS-10372 > URL: https://issues.apache.org/jira/browse/HDFS-10372 > Project: Hadoop HDFS > Issue Type: Bug > Components: test > Affects Versions: 2.7.3 > Reporter: Rushabh S Shah > Assignee: Rushabh S Shah > Attachments: HDFS-10372.patch > > > TestFsDatasetImpl#testCleanShutdownOfVolume fails very often. > We added more debug information in HDFS-10260 to find out why this test is > failing. > Now I think I know the root cause of failure. > I thought that {{LocatedBlock#getLocations()}} returns an array of > DatanodeInfo but now I realized that it returns an array of > DatandeStorageInfo (which is subclass of DatanodeInfo). > In the test I intended to check whether the exception contains the xfer > address of the DatanodeInfo. Since {{DatanodeInfo#toString()}} method returns > the xfer address, I checked whether exception contains > {{DatanodeInfo#toString}} or not. > But since {{LocatedBlock#getLocations()}} returned an array of > DatanodeStorageInfo, it has storage info in the toString() implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org