[ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526775#comment-14526775
 ] 

Jason Lowe commented on YARN-3554:
----------------------------------

YARN-3518 is a separate concern with different ramifications.  We should 
discuss it there and not mix these two.

bq. set this to a bigger value maybe based on network partition considerations 
not only for nm restart.
What value do you propose?  As pointed out earlier, anything over 10 minutes is 
pointless since the container allocation expires in that time.  Is it common 
for network partitions to take longer than 3 minutes but less than 10 minutes?  
If so we should tune the value for that.  If not then making the value larger 
just slows recovery time.

bq. 3 mins seems dangerous, If rm fails over and the recover takes serval mins, 
nm maybe kill all containers, in production env, it's not expected.

This JIRA is configuring the amount of time NM clients (i.e.: primarily 
ApplicationMasters and the RM when launching ApplicationMasters) will try to 
connect to a particular NM before failing.  I'm missing how RM failover leads 
to a mass killing of containers due to this proposed change.  This is not a 
property used by the NM, so the NM is not going to start killing all containers 
differently based on an updated value for it.  The only case where the RM will 
use this property is when connecting to NMs to launch AM containers, and it 
will only do so for NMs that have recently heartbeated.  Could you explain how 
this leads to all containers getting killed on a particular node?

> Default value for maximum nodemanager connect wait time is too high
> -------------------------------------------------------------------
>
>                 Key: YARN-3554
>                 URL: https://issues.apache.org/jira/browse/YARN-3554
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Naganarasimha G R
>              Labels: newbie
>         Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch
>
>
> The default value for yarn.client.nodemanager-connect.max-wait-ms is 900000 
> msec or 15 minutes, which is way too high.  The default container expiry time 
> from the RM and the default task timeout in MapReduce are both only 10 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to