[ 
https://issues.apache.org/jira/browse/HDFS-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17151:
----------------------------------
    Labels: pull-request-available  (was: )

> EC: Fix wrong metadata in BlockInfoStriped after recovery
> ---------------------------------------------------------
>
>                 Key: HDFS-17151
>                 URL: https://issues.apache.org/jira/browse/HDFS-17151
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Shuyan Zhang
>            Priority: Major
>              Labels: pull-request-available
>
> When the datanode completes a block recovery, it will call 
> `commitBlockSynchronization` method to notify NN the new locations of the 
> block. For a EC block group, NN determines the index of each internal block 
> based on the position of the DatanodeID in the parameter `newtargets`.
> If the internal blocks written by the client don't have continuous indices, 
> the current datanode code might cause NN to record incorrect block metadata. 
> For simplicity, let's take RS (3,2) as an example. The timeline of the 
> problem is as follows:
> 1. The client plans to write internal blocks with indices [0,1,2,3,4] to 
> datanode [dn0, dn1, dn2, dn3, dn4] respectively. But dn1 is unable to 
> connect, so the client only writes data to the remaining 4 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the content of `uc. getExpectedStorageLocations()` completely depends 
> on block reports, and now it is <dn0, dn2, dn3, dn4>;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. Datanode that receives the recovery command fills `DatanodeID [] newLocs` 
> with [dn0, null, dn2, dn3, dn4];
> 7. The serialization process filters out null values, so the parameters 
> passed to NN become [dn0, dn2, dn3, dn4];
> 8. NN mistakenly believes that dn2 stores an internal block with index 1, dn3 
> stores an internal block with index 2, and so on.
> The above timeline is just an example, and there are other situations that 
> may result in the same error, such as an update pipeline occurs on the client 
> side. We should fix this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to