[ https://issues.apache.org/jira/browse/TEZ-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568396#comment-14568396 ]
Jeff Zhang commented on TEZ-2509: --------------------------------- Should also stop assign containers in DelayedContainerManager ? > YarnTaskSchedulerService should not try to allocate containers if AM is > shutting down > ------------------------------------------------------------------------------------- > > Key: TEZ-2509 > URL: https://issues.apache.org/jira/browse/TEZ-2509 > Project: Apache Tez > Issue Type: Bug > Reporter: Hitesh Shah > Assignee: Hitesh Shah > Attachments: TEZ-2509.1.patch, TEZ-2509.2.patch > > > Observed when doing some recovery testing: > Failure as during dag shutdown, 4 attempts of the same task failed. > {code} > 2015-06-01 07:38:27,184 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_FINISHED]: > vertexName=initialmap, taskId=task_1433141118424_0012_2_00_000003, > startTime=1433144297281, finishTime=1433144307184, timeTaken=9903, > status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, > info=[Container container_e02_1433141118424_0012_01_000018 hit an invalid > transition - C_NM_STOP_SENT at RUNNING] > TaskAttempt 1 failed, info=[AttemptId: > attempt_1433141118424_0012_2_00_000003_1 cannot be allocated to container: > container_e02_1433141118424_0012_01_000011 in STOP_REQUESTED state] > TaskAttempt 2 failed, info=[Container > container_e02_1433141118424_0012_01_000012 hit an invalid transition - > C_NM_STOP_SENT at RUNNING] > TaskAttempt 3 failed, info=[Container > container_e02_1433141118424_0012_01_000025 hit an invalid transition - > C_NM_STOP_SENT at RUNNING], counters=Counters: 0 > {code} > > DAG kill signal received. > {code} > 2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: > DAGAppMasterShutdownHook invoked > 2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMaster > received a signal. Signaling TaskScheduler > {code} > First attempt marked as failed as container was killed. > {code} > 2015-06-01 07:38:26,906 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_ATTEMPT_FINISHED]: > vertexName=initialmap, > taskAttemptId=attempt_1433141118424_0012_2_00_000003_0, > startTime=1433144297281, finishTime=1433144306904, timeTaken=9623, > status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Container > container_e02_1433141118424_0012_01_000018 hit an invalid transition - > C_NM_STOP_SENT at RUNNING, counters=Counters: 0 > {code} > Subsequent attempt scheduled, assigned and eventually fails. > {code} > 2015-06-01 07:38:26,919 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Assigning container to task, > container=Container: [ContainerId: > container_e02_1433141118424_0012_01_000011, NodeId: > ip-172-31-18-41.ec2.internal:45454, NodeHttpAddress: > ip-172-31-18-41.ec2.internal:8042, Resource: <memory:1536, vCores:1>, > Priority: 2, Token: Token { kind: ContainerToken, service: 172.31.18.41:45454 > }, ], task=attempt_1433141118424_0012_2_00_000003_1, containerHost=ip-172-31, > localityMatchType=NodeLocal, matchedLocation=ip-172-31-18-41.ec2.internal, > honorLocalityFlags=true, reusedContainer=true, delayedContainers=4, > containerResourceMemory=1536, containerResourceVCores=1 > {code} > Scheduler stops too late. > {code} > 2015-06-01 07:38:27,403 DEBUG [Thread-3] service.AbstractService: Service: > org.apache.tez.dag.app.rm.YarnTaskSchedulerService entered state STOPPED > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)