Xing Lin created HDFS-17030: ------------------------------- Summary: Limit wait time for getHAServiceState in ObserverReaderProxy Key: HDFS-17030 URL: https://issues.apache.org/jira/browse/HDFS-17030 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.4.0 Reporter: Xing Lin
When HA is enabled and a standby NN is not responsible (either when it is down or a heap dump is being taken), we would wait for either _socket_connection_timeout * socket_max_retries_on_connection_timeout_ or _rpcTimeOut_ before moving on to the next NN. This adds a significantly latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org