[ https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200904#comment-14200904 ]
Varun Vasudev commented on YARN-2821: ------------------------------------- The root cause appears to be an unexpected over-allocation. In this case the app master got allocated one more container than it expected and went into an infinite loop in the finish function. With regards to the extra container, it's possible we're seeing a variant of YARN-110. Unfortunately the RM doesn't log asks so we can't tell the sequence of asks that led to the extra allocation. > Distributed shell app master becomes unresponsive sometimes > ----------------------------------------------------------- > > Key: YARN-2821 > URL: https://issues.apache.org/jira/browse/YARN-2821 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell > Affects Versions: 2.5.1 > Reporter: Varun Vasudev > Assignee: Varun Vasudev > > We've noticed that once in a while the distributed shell app master becomes > unresponsive and is eventually killed by the RM. snippet of the logs - > {noformat} > 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: > appattempt_1415123350094_0017_000001 received 0 previous attempts' running > containers on AM registration. > 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[<memory:10, vCores:1>]Priority[0] > 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[<memory:10, vCores:1>]Priority[0] > 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[<memory:10, vCores:1>]Priority[0] > 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[<memory:10, vCores:1>]Priority[0] > 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[<memory:10, vCores:1>]Priority[0] > 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : > onprem-tez2:45454 > 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from > RM for container ask, allocatedCnt=1 > 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell > command on a new container., > containerId=container_1415123350094_0017_01_000002, > containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, > containerResourceMemory1024, containerResourceVirtualCores1 > 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up > container launch container for > containerid=container_1415123350094_0017_01_000002 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > START_CONTAINER for Container container_1415123350094_0017_01_000002 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez2:45454 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > QUERY_CONTAINER for Container container_1415123350094_0017_01_000002 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez2:45454 > 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : > onprem-tez3:45454 > 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : > onprem-tez4:45454 > 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from > RM for container ask, allocatedCnt=3 > 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell > command on a new container., > containerId=container_1415123350094_0017_01_000003, > containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, > containerResourceMemory1024, containerResourceVirtualCores1 > 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell > command on a new container., > containerId=container_1415123350094_0017_01_000004, > containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, > containerResourceMemory1024, containerResourceVirtualCores1 > 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell > command on a new container., > containerId=container_1415123350094_0017_01_000005, > containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, > containerResourceMemory1024, containerResourceVirtualCores1 > 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up > container launch container for > containerid=container_1415123350094_0017_01_000003 > 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up > container launch container for > containerid=container_1415123350094_0017_01_000005 > 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up > container launch container for > containerid=container_1415123350094_0017_01_000004 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > START_CONTAINER for Container container_1415123350094_0017_01_000005 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > START_CONTAINER for Container container_1415123350094_0017_01_000003 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez4:45454 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez2:45454 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > START_CONTAINER for Container container_1415123350094_0017_01_000004 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez3:45454 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > QUERY_CONTAINER for Container container_1415123350094_0017_01_000005 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez4:45454 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > QUERY_CONTAINER for Container container_1415123350094_0017_01_000003 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez2:45454 > 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: > QUERY_CONTAINER for Container container_1415123350094_0017_01_000004 > 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > onprem-tez3:45454 > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from > RM for container ask, completedCnt=1 > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: > appattempt_1415123350094_0017_000001 got container status for > containerID=container_1415123350094_0017_01_000002, state=COMPLETE, > exitStatus=0, diagnostics= > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Container > completed successfully., containerId=container_1415123350094_0017_01_000002 > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from > RM for container ask, allocatedCnt=2 > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell > command on a new container., > containerId=container_1415123350094_0017_01_000006, > containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, > containerResourceMemory1024, containerResourceVirtualCores1 > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell > command on a new container., > containerId=container_1415123350094_0017_01_000007, > containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, > containerResourceMemory1024, containerResourceVirtualCores1 > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up > container launch container for > containerid=container_1415123350094_0017_01_000007 > 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up > container launch container for > containerid=container_1415123350094_0017_01_000006 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)