Shangshu Qian created HDFS-17836:
------------------------------------
Summary: Potential Feedback Loop in High Availbility
Key: HDFS-17836
URL: https://issues.apache.org/jira/browse/HDFS-17836
Project: Hadoop HDFS
Issue Type: Bug
Components: ha
Affects Versions: 2.10.2
Reporter: Shangshu Qian
We find a potential feedback loop in Hadoop High Availability. The bug needs at
least one active NameNode (ANN) and two standby NameNodes (SNN).
# ANN is under high load, causing edit log flushing to be delayed a bit. ANN
lost connection due to the same high load problem.
# SNN lose some recent IBRs due to known race conditions (HDFS-14941,
HDFS-17453).
# After failover, SNN_1 becomes the new ANN. Since IBR is lost, it believes
that some blocks are under-replicated, and it starts replicating them.
# At some point, DN sends the FBR (Full block report) to the ANN, causing the
blocks to be over-replicated.
# The new ANN needs to start processing the over-replicated blocks, further
contributing to the high load.
# The system goes back to step 1 and forms a feedback loop.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]