Xing Lin created HDFS-17030:
-------------------------------

             Summary: Limit wait time for getHAServiceState in 
ObserverReaderProxy
                 Key: HDFS-17030
                 URL: https://issues.apache.org/jira/browse/HDFS-17030
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: hdfs
    Affects Versions: 3.4.0
            Reporter: Xing Lin


When HA is enabled and a standby NN is not responsible (either when it is down 
or a heap dump is being taken), we would wait for either 
_socket_connection_timeout * socket_max_retries_on_connection_timeout_ or 
_rpcTimeOut_ before moving on to the next NN. This adds a significantly 
latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a 
request would need to take more than 2 mins to complete when we take a heap 
dump at a standby. This has been causing user job failures. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to