Kihwal Lee created HDFS-9106:
--------------------------------
Summary: Transfer failure during pipeline recovery causes
permanent write failures
Key: HDFS-9106
URL: https://issues.apache.org/jira/browse/HDFS-9106
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Kihwal Lee
Priority: Critical
When a new node is added to a write pipeline during flush/sync, if the partial
block transfer fails, the write will fail permanently without retrying or
continuing with whatever is in the pipeline.
The transfer often fails in busy clusters due to timeout. There is no
per-packet ACK between client and datanode or between source and target
datanodes. If the total transfer time exceeds the configured timeout + 10
seconds (2 * 5 seconds slack), it is considered failed.
I propose following changes:
- Transfer timeout needs to be different from per-packet timeout.
- transfer should be retried if fails.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)