[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587025#comment-14587025 ] Yuliya Feldman commented on YARN-3803: -- Changed to Major > Application hangs after more then one localization attempt fails on the same > NM > --- > > Key: YARN-3803 > URL: https://issues.apache.org/jira/browse/YARN-3803 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0, 2.5.1 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman > > In the sandbox (single node) environment with LinuxContainerExecutor when > first Application Localization attempt fails second attempt can not proceed > and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586731#comment-14586731 ] Karthik Kambatla commented on YARN-3803: We have this other issue because of which multiple AMs for the same app get assigned to the same node. So, this could be a pretty serious issue. > Application hangs after more then one localization attempt fails on the same > NM > --- > > Key: YARN-3803 > URL: https://issues.apache.org/jira/browse/YARN-3803 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0, 2.5.1 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Minor > > In the sandbox (single node) environment with LinuxContainerExecutor when > first Application Localization attempt fails second attempt can not proceed > and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586675#comment-14586675 ] Yuliya Feldman commented on YARN-3803: -- [~kasha] It happens only if you have single node (at least in my testing) - since AM 2nd+ attempt will happen on the same node. Though - I was debating whether to make it Major or not. I can change it to major. I will post a patch later today for the fix. > Application hangs after more then one localization attempt fails on the same > NM > --- > > Key: YARN-3803 > URL: https://issues.apache.org/jira/browse/YARN-3803 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0, 2.5.1 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Minor > > In the sandbox (single node) environment with LinuxContainerExecutor when > first Application Localization attempt fails second attempt can not proceed > and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586582#comment-14586582 ] Karthik Kambatla commented on YARN-3803: This seems like a serious issue. Any reason for marking it Minor? > Application hangs after more then one localization attempt fails on the same > NM > --- > > Key: YARN-3803 > URL: https://issues.apache.org/jira/browse/YARN-3803 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0, 2.5.1 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Minor > > In the sandbox (single node) environment with LinuxContainerExecutor when > first Application Localization attempt fails second attempt can not proceed > and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585372#comment-14585372 ] Yuliya Feldman commented on YARN-3803: -- This situation is easily reproducible while running any M/R job as user with id < 500 on a cluster with single NM using LinuxContainerExecutor. So far the only solution I found is to proceed with localization in DuplicateFetchResourceTransition if ref == 0. This solution does not seem to look very clean according to state transitions, but there is no otherwise any evidence that previous container localization failed. I would appreciate comments/thoughts on this > Application hangs after more then one localization attempt fails on the same > NM > --- > > Key: YARN-3803 > URL: https://issues.apache.org/jira/browse/YARN-3803 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0, 2.5.1 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Minor > > In the sandbox (single node) environment with LinuxContainerExecutor when > first Application Localization attempt fails second attempt can not proceed > and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585370#comment-14585370 ] Yuliya Feldman commented on YARN-3803: -- In LocalizedResource class in state transition there are following transitions: {code} // From INIT (ref == 0, awaiting req) .addTransition(ResourceState.INIT, ResourceState.DOWNLOADING, ResourceEventType.REQUEST, new FetchResourceTransition()) // From DOWNLOADING (ref > 0, may be localizing) .addTransition(ResourceState.DOWNLOADING, ResourceState.DOWNLOADING, ResourceEventType.REQUEST, new DuplicateFetchResourceTransition()) {code} So it assumes that if "from state" and "to state" is _DOWNLOADING_ and _ResourceEventType_ is _REQUEST_ then resource is being downloaded and transition becomes _DuplicateFetchResourceTransition_. Problem is that "ref" is not greater then 0 here, as resources were cleaned up during first attempt and we end up in the situation where nothing is happening until RM kills this app. > Application hangs after more then one localization attempt fails on the same > NM > --- > > Key: YARN-3803 > URL: https://issues.apache.org/jira/browse/YARN-3803 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0, 2.5.1 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Minor > > In the sandbox (single node) environment with LinuxContainerExecutor when > first Application Localization attempt fails second attempt can not proceed > and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)