Rajesh Balamohan created TEZ-1929:
-------------------------------------

             Summary: AM intermittently sending kill signal to running task in 
heartbeat
                 Key: TEZ-1929
                 URL: https://issues.apache.org/jira/browse/TEZ-1929
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Rajesh Balamohan


Observed this behavior 3 or 4 times

- Ran a hive query with tez (query_17 at 10 TB scale)
- Occasionally, Map_7 task will get into failed state in the middle of fetching 
data from other sources (only one task is available in Map_7).  

{code}
2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
Completed fetch for attempt: InputAttemptIdentifier 
[inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
pathComponent=attempt_1420000126204_0233_1_06_000000_0_10003] to MEMORY, 
CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, TimeTaken=5, 
Rate=1.29 MB/s
2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
inputs fetched for input vertex : Map 6
2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
Shutting down FetchScheduler, Was Interrupted: false
2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
Scheduler thread completed
2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Received 
should die response from AM
2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked to 
die via task heartbeat
2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
waiting for task to complete. Interrupting task
2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
returning
2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie notification 
via hearbeats. Shutting down
2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
        at 
org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
        at 
org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
        at 
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
{code}

>From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
>is unable to identify the containerId from registeredContainers.  Need to 
>verify this.

I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to