[ 
https://issues.apache.org/jira/browse/TEZ-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219084#comment-14219084
 ] 

Siddharth Seth commented on TEZ-1790:
-------------------------------------

[~jeffzhang], the patch mostly looks good to me. De-Allocates are scheduled at 
a higher priority so that the available containers are freed up for pending 
allocations. In a Kill scenario - like you said, it's possible for both to end 
up in the queue, which would cause this.

Minor: Once an associated allocate request is found, can we break out of the 
loop ?

> DeallocationTaskRequest may been handled before corresponding 
> AllocationTaskRequest in local mode
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1790
>                 URL: https://issues.apache.org/jira/browse/TEZ-1790
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1790.patch
>
>
> In Tez Local mode, when dag is kiiled, DeallocationTaskRequest may been 
> handled before corresponding AllocationTaskRequest handled. In that case, The 
> TaskRequest is not really deallocated. The AllocationTaskRequest will been 
> handled after DeallocationTaskRequest. When it is in local session mode, the 
> dag is killed but its TaskRequest is still there, and will continue launch 
> the task attempt. The task attempt will start the heartbeat with the AM, 
> while the AM has started a new DAG. It would cause the following exception. ( 
> The task attempt is heartbeating with a wrong DAG, because its DAG has been 
> killed)
> {code}
> 15:38:24,208 - Thread(TaskHeartbeatThread) - (TezTaskRunner.java:333) - 
> TaskReporter reported error
> java.lang.NullPointerException
>       at 
> org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.heartbeat(TaskAttemptListenerImpTezDag.java:514)
>       at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
>       at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:176)
>       at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> This error will cause the TezChild interuppted
> {code}
> 16:04:26,718 - Thread(TezChild) - (TezTaskRunner.java:221) - Encounted an 
> error while executing task: attempt_1416384252992_0001_2_00_000000_0
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
>       at 
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:211)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:173)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> This issue cause TestExceptionPropagation timeout sometimes, especially on 
> windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to