[ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1929:
----------------------------------
    Attachment: applog.txt.gz

Attaching the AM log.  Appears that 2 times the task (pertaining to Map 7) got 
killed due to internal preemption.

{code}
2015-01-08 00:18:14,508 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1420000126204_0233_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 7, taskAttemptId=attempt_1420000126204_0233_1_07_000000_0, 
startTime=1420705088342, finishTime=1420705094506, timeTaken=6164, 
status=FAILED, errorEnum=INTERNAL_PREEMPTION, diagnostics=Container 
container_1420000126204_0233_01_000055 finished with diagnostics set to 
[Container preempted internally], counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
...
...
2015-01-08 00:19:41,951 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1420000126204_0233_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 7, taskAttemptId=attempt_1420000126204_0233_1_07_000000_1, 
startTime=1420705150232, finishTime=1420705181950, timeTaken=31718, 
status=FAILED, errorEnum=INTERNAL_PREEMPTION, diagnostics=Container 
container_1420000126204_0233_01_000107 finished with diagnostics set to 
[Container preempted internally], counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
{code}

I didn't see task getting killed due to memory pressure in NodeManager logs.

> AM intermittently sending kill signal to running task in heartbeat
> ------------------------------------------------------------------
>
>                 Key: TEZ-1929
>                 URL: https://issues.apache.org/jira/browse/TEZ-1929
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
> 2015-01-08 at 2.28.04 PM.png, applog.txt.gz, tasklog.txt
>
>
> Observed this behavior 3 or 4 times
> - Ran a hive query with tez (query_17 at 10 TB scale)
> - Occasionally, Map_7 task will get into failed state in the middle of 
> fetching data from other sources (only one task is available in Map_7).  
> {code}
> 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
> Completed fetch for attempt: InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_1420000126204_0233_1_06_000000_0_10003] to MEMORY, 
> CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
> TimeTaken=5, Rate=1.29 MB/s
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
> inputs fetched for input vertex : Map 6
> 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
> of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Shutting down FetchScheduler, Was Interrupted: false
> 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
> Scheduler thread completed
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
> Received should die response from AM
> 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
> to die via task heartbeat
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
> waiting for task to complete. Interrupting task
> 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
> returning
> 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
> notification via hearbeats. Shutting down
> 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>       at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
>       at 
> org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
>       at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>       at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
> {code}
> From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
> is unable to identify the containerId from registeredContainers.  Need to 
> verify this.
> I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to