[ https://issues.apache.org/jira/browse/HDFS-15644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220765#comment-17220765 ]
Xiaoqiao He commented on HDFS-15644: ------------------------------------ Hi [~weichiu],[~ahussein], Just found that fix versions include 3.2.2, but cherry pick to branch-3.2 actually, should we also cherry-pick to branch-3.2.2 (which is pending release branch). Thanks > Failed volumes can cause DNs to stop block reporting > ---------------------------------------------------- > > Key: HDFS-15644 > URL: https://issues.apache.org/jira/browse/HDFS-15644 > Project: Hadoop HDFS > Issue Type: Bug > Components: block placement, datanode > Reporter: Ahmed Hussein > Assignee: Ahmed Hussein > Priority: Major > Labels: refactor > Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5 > > Attachments: HDFS-15644-branch-2.10.002.patch, HDFS-15644.001.patch, > HDFS-15644.002.patch > > > [~daryn] found a corner case where remove failed volumes can cause a NPE in > [FsDataSetImpl.getBlockReports()|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L1939]. > +Scenario:+ > * Inside {{Datanode#HandleVolumeFailures()}}, removing a failed volume is a > 2-step process. > ** First it's removed from from the volumes list > ** Later in time are the replicas scrubbed from the volume map > * A concurrent thread generating blockReports may access the replicaMap > accessing a non existing VolumeID. > He made a fix for that and we have been using it on our clusters since > Hadoop-2.7. > By analyzing the code, the bug is still applicable to Trunk. > * The path Datanode#removeVolumes() is safe because the two step process in > {{FsDataImpl.removeVolumes()}} > [FsDatasetImpl.java#L577|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L577] > is protected by {{datasetWriteLock}} . > * The path Datanode#handleVolumeFailures() is not safe because the failed > volume is removed from the list without acquiring > {{datasetWriteLock}}.[FsVolumList#239|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java#L239] > The race condition can cause the caller of getBlockReports() to throw NPE if > the RUR is referring to a volume that has already been removed > [FsDatasetImpl.java#L1976|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L1976]. > {code:java} > case RUR: > ReplicaInfo orig = b.getOriginalReplica(); > builders.get(volStorageID).add(orig); > break; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org