[ https://issues.apache.org/jira/browse/TEZ-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219084#comment-14219084 ]
Siddharth Seth commented on TEZ-1790: ------------------------------------- [~jeffzhang], the patch mostly looks good to me. De-Allocates are scheduled at a higher priority so that the available containers are freed up for pending allocations. In a Kill scenario - like you said, it's possible for both to end up in the queue, which would cause this. Minor: Once an associated allocate request is found, can we break out of the loop ? > DeallocationTaskRequest may been handled before corresponding > AllocationTaskRequest in local mode > ------------------------------------------------------------------------------------------------- > > Key: TEZ-1790 > URL: https://issues.apache.org/jira/browse/TEZ-1790 > Project: Apache Tez > Issue Type: Bug > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Attachments: TEZ-1790.patch > > > In Tez Local mode, when dag is kiiled, DeallocationTaskRequest may been > handled before corresponding AllocationTaskRequest handled. In that case, The > TaskRequest is not really deallocated. The AllocationTaskRequest will been > handled after DeallocationTaskRequest. When it is in local session mode, the > dag is killed but its TaskRequest is still there, and will continue launch > the task attempt. The task attempt will start the heartbeat with the AM, > while the AM has started a new DAG. It would cause the following exception. ( > The task attempt is heartbeating with a wrong DAG, because its DAG has been > killed) > {code} > 15:38:24,208 - Thread(TaskHeartbeatThread) - (TezTaskRunner.java:333) - > TaskReporter reported error > java.lang.NullPointerException > at > org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.heartbeat(TaskAttemptListenerImpTezDag.java:514) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:176) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > This error will cause the TezChild interuppted > {code} > 16:04:26,718 - Thread(TezChild) - (TezTaskRunner.java:221) - Encounted an > error while executing task: attempt_1416384252992_0001_2_00_000000_0 > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:211) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:173) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > This issue cause TestExceptionPropagation timeout sometimes, especially on > windows -- This message was sent by Atlassian JIRA (v6.3.4#6332)