Shangshu Qian created HDFS-17836:
------------------------------------

             Summary: Potential Feedback Loop in High Availbility
                 Key: HDFS-17836
                 URL: https://issues.apache.org/jira/browse/HDFS-17836
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: ha
    Affects Versions: 2.10.2
            Reporter: Shangshu Qian


We find a potential feedback loop in Hadoop High Availability. The bug needs at 
least one active NameNode (ANN) and two standby NameNodes (SNN).
 # ANN is under high load, causing edit log flushing to be delayed a bit. ANN 
lost connection due to the same high load problem. 
 # SNN lose some recent IBRs due to known race conditions (HDFS-14941, 
HDFS-17453).
 # After failover, SNN_1 becomes the new ANN. Since IBR is lost, it believes 
that some blocks are under-replicated, and it starts replicating them.
 # At some point, DN sends the FBR (Full block report) to the ANN, causing the 
blocks to be over-replicated.
 # The new ANN needs to start processing the over-replicated blocks, further 
contributing to the high load.
 # The system goes back to step 1 and forms a feedback loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to