[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

Tao Zhang (JIRA) Mon, 29 Jan 2018 12:49:52 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343997#comment-16343997
 ]


Tao Zhang commented on YARN-2175:
---------------------------------

We're facing this issue too. Localization in part of containers may take a long 
time due to underlying HDFS issues, machine network conditions, etc. AM will 
request more containers when it doesn't see enough available containers (which 
finish localization). However Yarn will keep those containers being stuck at 
localization and there's not a good automatic way to kill them. A dynamically 
adjusting Timeout feature for "localizing" would help here.

Comparing to a "pre-configured" timeout value, it'd be better to have a 
"dynamically adjusting" timeout. E.g, we calculate the avg localization time 
for first 50% containers of 1 app, then set *2 * 
avg_localizing_time_of_half_containers* as the timeout threshold for rest 
containers. This requires information of all containers localization time. 
Hence *RM* would be the appropriate component to implement this feature 
(container localization timeout). AM may not be a good choice since 
"localization" is a common process of yarn, and we don't want to implement this 
feature for each type of ApplicationMasters.

 

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-2175
>                 URL: https://issues.apache.org/jira/browse/YARN-2175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Anubhav Dhoot
>            Priority: Major
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

Reply via email to