[ https://issues.apache.org/jira/browse/HDFS-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080549#comment-15080549 ]
Junping Du commented on HDFS-9106: ---------------------------------- Hi [~hitliuyi], [~jingzhao] and [~kihwal], do we think this bug should be fixed in branch-2.6 also? > Transfer failure during pipeline recovery causes permanent write failures > ------------------------------------------------------------------------- > > Key: HDFS-9106 > URL: https://issues.apache.org/jira/browse/HDFS-9106 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > Fix For: 2.7.2 > > Attachments: HDFS-9106-poc.patch, HDFS-9106.branch-2.7.patch, > HDFS-9106.patch > > > When a new node is added to a write pipeline during flush/sync, if the > partial block transfer fails, the write will fail permanently without > retrying or continuing with whatever is in the pipeline. > The transfer often fails in busy clusters due to timeout. There is no > per-packet ACK between client and datanode or between source and target > datanodes. If the total transfer time exceeds the configured timeout + 10 > seconds (2 * 5 seconds slack), it is considered failed. Naturally, the > failure rate is higher with bigger block sizes. > I propose following changes: > - Transfer timeout needs to be different from per-packet timeout. > - transfer should be retried if fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)