[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM

2015-06-15 Thread Yuliya Feldman (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587025#comment-14587025
 ] 

Yuliya Feldman commented on YARN-3803:
--

Changed to Major

> Application hangs after more then one localization attempt fails on the same 
> NM
> ---
>
> Key: YARN-3803
> URL: https://issues.apache.org/jira/browse/YARN-3803
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0, 2.5.1
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>
> In the sandbox (single node) environment with LinuxContainerExecutor when 
> first Application Localization attempt fails second attempt can not proceed 
> and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM

2015-06-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586731#comment-14586731
 ] 

Karthik Kambatla commented on YARN-3803:


We have this other issue because of which multiple AMs for the same app get 
assigned to the same node. So, this could be a pretty serious issue. 

> Application hangs after more then one localization attempt fails on the same 
> NM
> ---
>
> Key: YARN-3803
> URL: https://issues.apache.org/jira/browse/YARN-3803
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0, 2.5.1
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Minor
>
> In the sandbox (single node) environment with LinuxContainerExecutor when 
> first Application Localization attempt fails second attempt can not proceed 
> and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM

2015-06-15 Thread Yuliya Feldman (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586675#comment-14586675
 ] 

Yuliya Feldman commented on YARN-3803:
--

[~kasha] It happens only if you have single node (at least in my testing) - 
since AM 2nd+ attempt will happen on the same node. Though - I was debating 
whether to make it Major or not. I can change it to major.

I will post a patch later today for the fix. 

> Application hangs after more then one localization attempt fails on the same 
> NM
> ---
>
> Key: YARN-3803
> URL: https://issues.apache.org/jira/browse/YARN-3803
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0, 2.5.1
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Minor
>
> In the sandbox (single node) environment with LinuxContainerExecutor when 
> first Application Localization attempt fails second attempt can not proceed 
> and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM

2015-06-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586582#comment-14586582
 ] 

Karthik Kambatla commented on YARN-3803:


This seems like a serious issue. Any reason for marking it Minor? 

> Application hangs after more then one localization attempt fails on the same 
> NM
> ---
>
> Key: YARN-3803
> URL: https://issues.apache.org/jira/browse/YARN-3803
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0, 2.5.1
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Minor
>
> In the sandbox (single node) environment with LinuxContainerExecutor when 
> first Application Localization attempt fails second attempt can not proceed 
> and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM

2015-06-14 Thread Yuliya Feldman (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585372#comment-14585372
 ] 

Yuliya Feldman commented on YARN-3803:
--

This situation is easily reproducible while running any M/R job as user with id 
< 500 on a cluster with single NM using LinuxContainerExecutor.

So far the only solution I found is to proceed with localization in 
DuplicateFetchResourceTransition if ref == 0.
This solution does not seem to look very clean according to state transitions, 
but there is no otherwise any evidence that previous container localization 
failed.

I would appreciate comments/thoughts on this

> Application hangs after more then one localization attempt fails on the same 
> NM
> ---
>
> Key: YARN-3803
> URL: https://issues.apache.org/jira/browse/YARN-3803
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0, 2.5.1
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Minor
>
> In the sandbox (single node) environment with LinuxContainerExecutor when 
> first Application Localization attempt fails second attempt can not proceed 
> and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM

2015-06-14 Thread Yuliya Feldman (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585370#comment-14585370
 ] 

Yuliya Feldman commented on YARN-3803:
--

In LocalizedResource class in state transition there are following transitions:
{code}
// From INIT (ref == 0, awaiting req)
.addTransition(ResourceState.INIT, ResourceState.DOWNLOADING,
ResourceEventType.REQUEST, new FetchResourceTransition())

// From DOWNLOADING (ref > 0, may be localizing)
.addTransition(ResourceState.DOWNLOADING, ResourceState.DOWNLOADING,
ResourceEventType.REQUEST, new DuplicateFetchResourceTransition())
{code}

So it assumes that if "from state" and "to state" is _DOWNLOADING_ and 
_ResourceEventType_ is _REQUEST_ then resource is being downloaded and 
transition becomes _DuplicateFetchResourceTransition_.
Problem is that "ref" is not greater then 0 here, as resources were cleaned up 
during first attempt and we end up in the situation where nothing is happening 
until RM kills this app.


> Application hangs after more then one localization attempt fails on the same 
> NM
> ---
>
> Key: YARN-3803
> URL: https://issues.apache.org/jira/browse/YARN-3803
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0, 2.5.1
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Minor
>
> In the sandbox (single node) environment with LinuxContainerExecutor when 
> first Application Localization attempt fails second attempt can not proceed 
> and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)