[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526775#comment-14526775 ]
Jason Lowe commented on YARN-3554: ---------------------------------- YARN-3518 is a separate concern with different ramifications. We should discuss it there and not mix these two. bq. set this to a bigger value maybe based on network partition considerations not only for nm restart. What value do you propose? As pointed out earlier, anything over 10 minutes is pointless since the container allocation expires in that time. Is it common for network partitions to take longer than 3 minutes but less than 10 minutes? If so we should tune the value for that. If not then making the value larger just slows recovery time. bq. 3 mins seems dangerous, If rm fails over and the recover takes serval mins, nm maybe kill all containers, in production env, it's not expected. This JIRA is configuring the amount of time NM clients (i.e.: primarily ApplicationMasters and the RM when launching ApplicationMasters) will try to connect to a particular NM before failing. I'm missing how RM failover leads to a mass killing of containers due to this proposed change. This is not a property used by the NM, so the NM is not going to start killing all containers differently based on an updated value for it. The only case where the RM will use this property is when connecting to NMs to launch AM containers, and it will only do so for NMs that have recently heartbeated. Could you explain how this leads to all containers getting killed on a particular node? > Default value for maximum nodemanager connect wait time is too high > ------------------------------------------------------------------- > > Key: YARN-3554 > URL: https://issues.apache.org/jira/browse/YARN-3554 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Jason Lowe > Assignee: Naganarasimha G R > Labels: newbie > Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch > > > The default value for yarn.client.nodemanager-connect.max-wait-ms is 900000 > msec or 15 minutes, which is way too high. The default container expiry time > from the RM and the default task timeout in MapReduce are both only 10 > minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)