[ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296540#comment-14296540
 ] 

Bikas Saha commented on TEZ-1929:
---------------------------------

Will rename.

Right. Its not part of this patch. These are a bunch of well defined enums for 
error causes. We could separately just rename the enum to ErrorCause instead of 
TATerminationCause or ContainerTerminationCause. Prevents proliferation of 
identical enums for dag, vertex, task. container etc.

Will update the patch with the rename.

> AM intermittently sending kill signal to running task in heartbeat
> ------------------------------------------------------------------
>
>                 Key: TEZ-1929
>                 URL: https://issues.apache.org/jira/browse/TEZ-1929
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Rajesh Balamohan
>            Assignee: Bikas Saha
>         Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, TEZ-1929.1.patch, TEZ-1929.2.patch, 
> applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_1420000126204_0233_1_06_000000_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>       at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>       at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>       at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to