Shangshu Qian created HDFS-17661:
------------------------------------
Summary: BlockRecoveryWorker may have a contention with the
BPServiceActor, causing missing IBRs
Key: HDFS-17661
URL: https://issues.apache.org/jira/browse/HDFS-17661
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Reporter: Shangshu Qian
We found that a large number of BlockRecoveryWorker may cause
IncrementalBlockReports (IBRs) to be delayed due to IOExceptions. Under some
edge cases, DataNode may run into a feedback loop.
The feedback loop can happen when the cluster is under high load:
# A high load in DN may trigger an IOException in the
IncrementalBlockReportManager.sendIBRs(). Under the current implementation, the
IBR is requeued and the IOE is swallowed. Assume some block deletions are
delayed at this point.
# When the DataXceiver transfers a block,
DataNode.transferReplicaForPipelineRecovery() can hit an IOE when it cannot
retrieve a block locally. This can happen when the IBR containing the block
deletion is delayed.
# The IOE in the write pipeline can trigger a pipeline rebuild, and slows down
the client. The client's lease now has a higher chance to be taken over by
another client or expire. Both cases can trigger a lease recovery, which
includes a block recovery and puts more workload into the DN.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]