Both NN crashes due to JN timeout (Hadoop 3)

Wei-Chiu Chuang Mon, 07 Dec 2020 21:51:33 -0800

Hi community,

I want to share with you this observation.


We received several case reports that users sometimes experience
JournalNode timeout when NN requests edits from JN. The end result is
(both!) NN crash after the timeout (10 seconds).

It seems to only happen to Hadoop 3 users (CDH6 and HDP3). While
HADOOP-15696 <https://issues.apache.org/jira/browse/HADOOP-15696> offered a
configurable switch for you to increase hadoop.http.idle_timeout.ms, it
looks like a regression in Hadoop 3 and NN shouldn't simply crash because
JN is slightly slow. It looks to me a 10 second timeout for fetching edits
from JN is simply too low.

I believe this is a regression caused when we updated Jetty from 6 to 9 in
Hadoop 3 (HADOOP-10075 <https://issues.apache.org/jira/browse/HADOOP-10075>).
We replaced SelectChannelConnector.setLowResourceMaxIdleTime()
with ServerConnector.setIdleTimeout() but they aren't the same.

http://archive.eclipse.org/jetty/7.0.0.RC0/apidocs/org/eclipse/jetty/server/nio/SelectChannelConnector.html#getLowResourcesMaxIdleTime()

https://www.eclipse.org/jetty/javadoc/9.4.26.v20200117/org/eclipse/jetty/server/AbstractConnector.html#setIdleTimeout(long)

Does any know the behavior back in Hadoop 2/Jetty6? Does it use the Jetty's
default idle time which is 300 seconds?

Both NN crashes due to JN timeout (Hadoop 3)

Reply via email to