[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool

ASF GitHub Bot (Jira) Fri, 14 Oct 2022 03:12:07 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617622#comment-17617622
 ]


ASF GitHub Bot commented on HDFS-16804:
---------------------------------------

ZanderXu opened a new pull request, #5033:
URL: https://github.com/apache/hadoop/pull/5033

   Jira: [HDFS-16804](https://issues.apache.org/jira/browse/HDFS-16804) 
   
   Add Volume contains a race condition with shutdown block pool, causing the 
ReplicaMap still contains some blocks belong to the removed block pool.
   
   And the new volume still contains one unused BlockPoolSlice belongs to the 
removed block pool, caused some problems, such as: incorrect dfsUsed, incorrect 
numBlocks of the volume.
   
   Let's review the logic of addVolume and shutdownBlockPool respectively.
   
   AddVolume Logic:
   
   - Step1: Get all namespaceInfo from blockPoolManager
   - Step2: Create one temporary FsVolumeImpl object
   - Step3: Create some blockPoolSlice according to the namespaceInfo and add 
them to the temporary FsVolumeImpl object
   - Step4: Scan all blocks of the namespaceInfo from the volume and store them 
by one temporary ReplicaMap
   - Step5: Active the temporary FsVolumeImpl which created before (with 
FsDatasetImpl synchronized lock)
   - Step5.1: Merge all blocks of the temporary ReplicaMap to the global 
ReplicaMap
   - Step5.2: Add the FsVolumeImpl to the volumes
   
   ShutdownBlockPool Logic:(with blockPool write lock)
   
   - Step1: Cleanup the blockPool from the global ReplicaMap
   - Step2: Shutdown the block pool from all the volumes
   - Step2.1: do some clean operations for the block pool, such as saveReplica, 
saveDfsUsed, etc
   - Step2.2: remove the blockPool from bpSlices
   
   The race condition can be reproduced by the following steps:
   - AddVolume Step1: Get all namespaceInfo from blockPoolManager
   - ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap
   - ShutdownBlockPool Step2: Shutdown the block pool from all the volumes
   - AddVolume Step 2~5
   
   And result:
   - The global replicaMap contains some blocks belong to the removed blockPool
   - The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the 
removed blockPool
   
   Expected result:
   - The global replicaMap shouldn't contain any blocks belong to the removed 
blockPool
   - The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice 
belong to the removed blockPool




> AddVolume contains a race condition with shutdown block pool
> ------------------------------------------------------------
>
>                 Key: HDFS-16804
>                 URL: https://issues.apache.org/jira/browse/HDFS-16804
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>
> Add Volume contains a race condition with shutdown block pool, causing the 
> ReplicaMap still contains some blocks belong to the removed block pool.
> And the new volume still contains one unused BlockPoolSlice belongs to the 
> removed block pool, caused some problems, such as: incorrect dfsUsed, 
> incorrect numBlocks of the volume.
> Let's review the logic of addVolume and shutdownBlockPool respectively.
>  
> AddVolume Logic:
>  * Step1: Get all namespaceInfo from blockPoolManager
>  * Step2: Create one temporary FsVolumeImpl object
>  * Step3: Create some blockPoolSlice according to the namespaceInfo and add 
> them to the temporary FsVolumeImpl object
>  * Step4: Scan all blocks of the namespaceInfo from the volume and store them 
> by one temporary ReplicaMap
>  * Step5: Active the temporary FsVolumeImpl which created before (with 
> FsDatasetImpl synchronized lock)
>  ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global 
> ReplicaMap
>  ** Step5.2: Add the FsVolumeImpl to the volumes
> ShutdownBlockPool Logic:(with blockPool write lock)
>  * Step1: Cleanup the blockPool from the global ReplicaMap
>  * Step2: Shutdown the block pool from all the volumes
>  ** Step2.1: do some clean operations for the block pool, such as 
> saveReplica, saveDfsUsed, etc
>  ** Step2.2: remove the blockPool from bpSlices
> The race condition can be reproduced by the following steps:
>  * AddVolume Step1: Get all namespaceInfo from blockPoolManager
>  * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap
>  * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes
>  * AddVolume Step 2~5
> And result:
>  * The global replicaMap contains some blocks belong to the removed blockPool
>  * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the 
> removed blockPool
> Expected result:
>  * The global replicaMap shouldn't contain any blocks belong to the removed 
> blockPool
>  * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice 
> belong to the removed blockPool



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool

Reply via email to