[ https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=780277&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-780277 ]
ASF GitHub Bot logged work on HDFS-16601: ----------------------------------------- Author: ASF GitHub Bot Created on: 10/Jun/22 10:29 Start Date: 10/Jun/22 10:29 Worklog Time Spent: 10m Work Description: Hexiaoqiao commented on PR #4369: URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1152217871 Sorry for not very clear comment. I know not it's round-robin way to pick the source node, and at third round it will pick the original node again (no matter if it is bad/slow node.), of course it will be a tiny probability. Actually, I mean, it will be helpful for client to do many fault-tolerant improvement later if we could differ the exception about transfer. Once more, this is not blocker comment. Thanks again. Issue Time Tracking ------------------- Worklog Id: (was: 780277) Time Spent: 1h 40m (was: 1.5h) > Failed to replace a bad datanode on the existing pipeline due to no more good > datanodes being available to try > -------------------------------------------------------------------------------------------------------------- > > Key: HDFS-16601 > URL: https://issues.apache.org/jira/browse/HDFS-16601 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: ZanderXu > Assignee: ZanderXu > Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > In our production environment, we found a bug and stack like: > {code:java} > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK], > > DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]], > > original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK], > > DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > at > org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418) > at > org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478) > at > org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674) > {code} > And the root cause is that DFSClient cannot perceive the exception of > TransferBlock during PipelineRecovery. If failed during TransferBlock, the > DFSClient will retry all datanodes in the cluster and then failed. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org