[
https://issues.apache.org/jira/browse/HDFS-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaoqiao He resolved HDFS-17151.
--------------------------------
Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Resolution: Fixed
> EC: Fix wrong metadata in BlockInfoStriped after recovery
> ---------------------------------------------------------
>
> Key: HDFS-17151
> URL: https://issues.apache.org/jira/browse/HDFS-17151
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Shuyan Zhang
> Assignee: Shuyan Zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When the datanode completes a block recovery, it will call
> `commitBlockSynchronization` method to notify NN the new locations of the
> block. For a EC block group, NN determines the index of each internal block
> based on the position of the DatanodeID in the parameter `newtargets`.
> If the internal blocks written by the client don't have continuous indices,
> the current datanode code might cause NN to record incorrect block metadata.
> For simplicity, let's take RS (3,2) as an example. The timeline of the
> problem is as follows:
> 1. The client plans to write internal blocks with indices [0,1,2,3,4] to
> datanode [dn0, dn1, dn2, dn3, dn4] respectively. But dn1 is unable to
> connect, so the client only writes data to the remaining 4 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the content of `uc. getExpectedStorageLocations()` completely depends
> on block reports, and now it is <dn0, dn2, dn3, dn4>;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. Datanode that receives the recovery command fills `DatanodeID [] newLocs`
> with [dn0, null, dn2, dn3, dn4];
> 7. The serialization process filters out null values, so the parameters
> passed to NN become [dn0, dn2, dn3, dn4];
> 8. NN mistakenly believes that dn2 stores an internal block with index 1, dn3
> stores an internal block with index 2, and so on.
> The above timeline is just an example, and there are other situations that
> may result in the same error, such as an update pipeline occurs on the client
> side. We should fix this bug.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]