[jira] [Updated] (HDFS-17801) EC: Reading support retryCurrentNode to avoid transient errors cause application level failures

farmmamba (Jira) Tue, 01 Jul 2025 03:25:26 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


farmmamba updated HDFS-17801:
-----------------------------
    Description: 
Currently , there is no retry policy If we meet IOExcetion when creating block 
reader to read EC file. Suppose below case (using RS-6-3-1024k):

The first 4 to 6 data blocks' datanodes are very busy at the same time, 
createBlockReader will timeout. This will cause read failure, we should make EC 
support retry mechanism to mitigate read failure.

  was:
*Description of PR*
  Under the 3-replication read implementation, when an IOException occurs, 
there is the retryCurrentNode mechanism.
This is very useful to avoid application level failures due to transient errors 
(e.g. Datanode could have closed the connection because the client is idle for 
too long).  Please refer to below codes : 
[https://github.com/apache/hadoop/blob/6eae1589aeea9bd9c6885e405bd9be5ef6199df7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L824-L828]

  We should make EC read also support this mechanism. 

  BTW, this issue is motivated by the failure of our cluster's applications 
failure when we change the data from 3-rep to EC policy.

*How was this patch tested?*
Add an unit test.


> EC: Reading support retryCurrentNode to avoid transient errors cause 
> application level failures
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17801
>                 URL: https://issues.apache.org/jira/browse/HDFS-17801
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: farmmamba
>            Assignee: farmmamba
>            Priority: Major
>              Labels: pull-request-available
>
> Currently , there is no retry policy If we meet IOExcetion when creating 
> block reader to read EC file. Suppose below case (using RS-6-3-1024k):
> The first 4 to 6 data blocks' datanodes are very busy at the same time, 
> createBlockReader will timeout. This will cause read failure, we should make 
> EC support retry mechanism to mitigate read failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17801) EC: Reading support retryCurrentNode to avoid transient errors cause application level failures

Reply via email to