Wei-Chiu Chuang created HDFS-15719:
--------------------------------------

             Summary: [Hadoop 3] Both NameNodes can crash simultaneously due to 
the short JN socket timeout
                 Key: HDFS-15719
                 URL: https://issues.apache.org/jira/browse/HDFS-15719
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Wei-Chiu Chuang


After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in 
HADOOP-10075.

However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too 
low.
We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with 
ServerConnector.setIdleTimeout() but they aren't the same.

Essentially, the HttpServer2's idle timeout was the default timeout set by 
Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 
seconds, which is unreasonable for JN. If NameNodes try to download a big edit 
log from JournalNodes (say a few hundred MB), it is likely to exceed 10 
seconds. When it happens, both NN crashes and there's no way to workaround 
unless you apply the patch in HADOOP-15696 to add a config switch for the idle 
timeout. Fortunately, it doesn't happen a lot.

Propose: bump the idle timeout default to 200 seconds to match the behavior in 
Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is not 
suitable for JN)

Other things to consider:
1. fsck serverlet? (somehow I suspect this is related to the socket timeout 
reported in HDFS-7175)
2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. 
so having a longer timeout makes sense here.
2. kms? will the longer timeout cause more lingering sockets?

Thanks [~zhenshan.wen] for the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to