[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

Sergey Shelukhin (JIRA) Mon, 28 Nov 2016 13:23:40 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HIVE-15296:
------------------------------------
    Description: 
First attempt and failure detection:
{noformat}
2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
capability=&lt;memory:4096, vCores:1&gt;, hosts=[3n01]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Assigned task 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
localityDelayTimeout=9223372036854775807} to container 
container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true, 
host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, 
shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
startTime=10550817928, containerId=container_222212222_2622_01_000056, 
assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, 
servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
localityDelayTimeout=9223372036854775807} = SCHEDULED
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_0] 
started. Is using containerId: [container_222212222_2622_01_000056] on NM: 
[3n01:15001]
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, 
startTime=1479500403427, containerId=container_222212222_2622_01_000056, 
nodeId=3n01:15001
2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
|tezplugins.LlapTaskCommunicator|: Successfully launched task: 
attempt_1478967587833_2622_1_06_000031_0
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
|impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0 sent 
events: (0-1).
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
|impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_0 24 events 
[0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, 
creationTime=1479500401929, allocationTime=1479500403426, 
startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 
300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED, endReason=OTHER
2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: 
Attempt failed on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 
failed: true container: container_222212222_2622_01_000056 numFailedTAs: 7
2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Source task attempt completed for vertex: vertex_1478967587833_2622_1_07 
[Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0 with state: 
FAILED vertexState: RUNNING
{noformat}
Second attempt:
{noformat}
2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
capability=&lt;memory:4096, vCores:1&gt;, hosts=null
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Preempting for 
task=attempt_1478967587833_2622_1_06_000031_1 on any available host
2016-11-18 20:25:43,298 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES
2016-11-18 20:25:43,299 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Assigned task 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTimeout=9223372036854775807} to container 
container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true, 
host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, 
shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004]
2016-11-18 20:25:43,300 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=10551159245, containerId=container_222212222_2622_01_000164, 
assignedInstance=DynamicServiceInstance [alive=true, host=1n02:15001 with 
resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, 
servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, 
localityDelayTimeout=9223372036854775807} = SCHEDULED
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_1] 
started. Is using containerId: [container_222212222_2622_01_000164] on NM: 
[1n02:15001]
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, 
startTime=1479500743301, containerId=container_222212222_2622_01_000164, 
nodeId=1n02:15001
2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] 
|tezplugins.LlapTaskCommunicator|: Successfully launched task: 
attempt_1478967587833_2622_1_06_000031_1
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: 
TaskAttempt:attempt_1478967587833_2622_1_06_000031_1 sent events: (0-1).
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] 
|impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_1 24 events 
[0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
{noformat}
The IPC thread logs some other activity 250ms later, so it's not stuck nor did 
it log an error.
After this, there are no mentions of any tasks for 2622_1_06_000031 in the log.
The attempt was not running in LLAP.

However, the attempt does come up again when the DAG is killed, long time after 
it was stuck; presumably, AM failed to detect the failure (or heartbeat failure 
for that matter?) and never tried to reschedule.
{noformat}
2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, 
creationTime=1479500743255, allocationTime=1479500743300, 
startTime=1479500743301, finishTime=1479804314228, timeTaken=303570927, 
status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL is 
received. Killing attempt. Diagnostics: Killing tasks in vertex: 
vertex_1478967587833_2622_1_06 [Map 1] due to trigger: DAG_TERMINATED, 
nodeHttpAddress=http://1n02:15002, counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1
{noformat}

I do see some errors elsewhere to this node (w/Connection refused), but they 
happen later. Let me know if you need anything else from AM log.
{noformat}
2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to 
connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded 
maximum allowed time (in MILLISECONDS): 16000
2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to 
connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded 
maximum allowed time (in MILLISECONDS): 16000
{noformat}

It seems like a parallel query caused the memory exhaustion on LLAP; most 
mappers for this job eventually completed successfully, a few were lost as per 
above, and the DAG got stuck because of that.
Relevant slider containers and restarts (the last one seems to be the one where 
the attempt was running).
Unfortunately the logs from relevant LLAP containers are missing from 
aggregation.
{noformat}
2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000005', 
container=ContainerID=container_e55_1478967587833_2465_01_000005 
nodeID=3n01:45454  ... diagnostics='Container 
[pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is running 
beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical 
memory used; 141.8 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000003', 
container=ContainerID=container_e55_1478967587833_2465_01_000003 
nodeID=1n02:45454  ... diagnostics='Container 
[pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is running 
beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB physical 
memory used; 142.5 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000007', 
container=ContainerID=container_e55_1478967587833_2465_01_000007 
nodeID=3n01:45454  ... diagnostics='Container 
[pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is running 
beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical 
memory used; 143.2 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000009', 
container=ContainerID=container_e55_1478967587833_2465_01_000009 
nodeID=1n02:45454  ... diagnostics='Container 
[pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is running 
beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB physical 
memory used; 144.4 GB of 294.5 GB virtual memory used. Killing container.

2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
{noformat}


  was:
First attempt and failure detection:
{noformat}
2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
capability=&lt;memory:4096, vCores:1&gt;, hosts=[3n01]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Assigned task 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
localityDelayTimeout=9223372036854775807} to container 
container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true, 
host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, 
shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
startTime=10550817928, containerId=container_222212222_2622_01_000056, 
assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, 
servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
localityDelayTimeout=9223372036854775807} = SCHEDULED
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_0] 
started. Is using containerId: [container_222212222_2622_01_000056] on NM: 
[3n01:15001]
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, 
startTime=1479500403427, containerId=container_222212222_2622_01_000056, 
nodeId=3n01:15001
2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
|tezplugins.LlapTaskCommunicator|: Successfully launched task: 
attempt_1478967587833_2622_1_06_000031_0
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
|impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0 sent 
events: (0-1).
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
|impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_0 24 events 
[0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, 
creationTime=1479500401929, allocationTime=1479500403426, 
startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 
300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED, endReason=OTHER
2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: 
Attempt failed on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 
failed: true container: container_222212222_2622_01_000056 numFailedTAs: 7
2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Source task attempt completed for vertex: vertex_1478967587833_2622_1_07 
[Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0 with state: 
FAILED vertexState: RUNNING
{noformat}
Second attempt:
{noformat}
2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
capability=&lt;memory:4096, vCores:1&gt;, hosts=null
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Preempting for 
task=attempt_1478967587833_2622_1_06_000031_1 on any available host
2016-11-18 20:25:43,298 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES
2016-11-18 20:25:43,299 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Assigned task 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTimeout=9223372036854775807} to container 
container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true, 
host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, 
shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004]
2016-11-18 20:25:43,300 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
startTime=10551159245, containerId=container_222212222_2622_01_000164, 
assignedInstance=DynamicServiceInstance [alive=true, host=1n02:15001 with 
resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, 
servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, 
localityDelayTimeout=9223372036854775807} = SCHEDULED
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_1] 
started. Is using containerId: [container_222212222_2622_01_000164] on NM: 
[1n02:15001]
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, 
startTime=1479500743301, containerId=container_222212222_2622_01_000164, 
nodeId=1n02:15001
2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] 
|tezplugins.LlapTaskCommunicator|: Successfully launched task: 
attempt_1478967587833_2622_1_06_000031_1
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: 
TaskAttempt:attempt_1478967587833_2622_1_06_000031_1 sent events: (0-1).
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] 
|impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_1 24 events 
[0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
{noformat}
The IPC thread logs some other activity 250ms later, so it's not stuck nor did 
it log an error.
After this, there are no mentions of any tasks for 2622_1_06_000031 in the log.
The attempt was not running in LLAP.

However, the attempt does come up again when the DAG is killed, long time after 
it was stuck:
{noformat}
2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, 
creationTime=1479500743255, allocationTime=1479500743300, 
startTime=1479500743301, finishTime=1479804314228, timeTaken=303570927, 
status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL is 
received. Killing attempt. Diagnostics: Killing tasks in vertex: 
vertex_1478967587833_2622_1_06 [Map 1] due to trigger: DAG_TERMINATED, 
nodeHttpAddress=http://1n02:15002, counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1
{noformat}

I do see some errors elsewhere to this node (w/Connection refused), but they 
happen later. Let me know if you need anything else from AM log.
{noformat}
2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to 
connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded 
maximum allowed time (in MILLISECONDS): 16000
2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to 
connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded 
maximum allowed time (in MILLISECONDS): 16000
{noformat}

Relevant slider containers and restarts (the last one seems to be the one where 
the attempt was running).
Unfortunately the logs from relevant LLAP containers are missing from 
aggregation.
{noformat}
2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000005', 
container=ContainerID=container_e55_1478967587833_2465_01_000005 
nodeID=3n01:45454  ... diagnostics='Container 
[pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is running 
beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical 
memory used; 141.8 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000003', 
container=ContainerID=container_e55_1478967587833_2465_01_000003 
nodeID=1n02:45454  ... diagnostics='Container 
[pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is running 
beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB physical 
memory used; 142.5 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000007', 
container=ContainerID=container_e55_1478967587833_2465_01_000007 
nodeID=3n01:45454  ... diagnostics='Container 
[pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is running 
beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical 
memory used; 143.2 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR 
appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
id='container_e55_1478967587833_2465_01_000009', 
container=ContainerID=container_e55_1478967587833_2465_01_000009 
nodeID=1n02:45454  ... diagnostics='Container 
[pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is running 
beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB physical 
memory used; 144.4 GB of 294.5 GB virtual memory used. Killing container.

2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO  
agent.AgentProviderService - Component operation. Status: COMPLETED; new 
container state: HEALTHY; new component state: STARTED
{noformat}



> AM may lose task failures and not reschedule when scheduling to LLAP
> --------------------------------------------------------------------
>
>                 Key: HIVE-15296
>                 URL: https://issues.apache.org/jira/browse/HIVE-15296
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
> capability=&lt;memory:4096, vCores:1&gt;, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_222212222_2622_01_000056 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, 
> startTime=10550817928, containerId=container_222212222_2622_01_000056, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: 
> [container_222212222_2622_01_000056] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, 
> startTime=1479500403427, containerId=container_222212222_2622_01_000056, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_000031_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] 
> |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: 
> attempt_1478967587833_2622_1_06_000031_0 failed: true container: 
> container_222212222_2622_01_000056 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1478967587833_2622_1_07 [Reducer 2] attempt: 
> attempt_1478967587833_2622_1_06_000031_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
> capability=&lt;memory:4096, vCores:1&gt;, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
> localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Preempting for 
> task=attempt_1478967587833_2622_1_06_000031_1 on any available host
> 2016-11-18 20:25:43,298 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
> localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES
> 2016-11-18 20:25:43,299 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
> localityDelayTimeout=9223372036854775807} to container 
> container_222212222_2622_01_000164 on node=DynamicServiceInstance 
> [alive=true, host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, 
> shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004]
> 2016-11-18 20:25:43,300 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, 
> startTime=10551159245, containerId=container_222212222_2622_01_000164, 
> assignedInstance=DynamicServiceInstance [alive=true, host=1n02:15001 with 
> resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, 
> servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: 
> [container_222212222_2622_01_000164] on NM: [1n02:15001]
> 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, 
> startTime=1479500743301, containerId=container_222212222_2622_01_000164, 
> nodeId=1n02:15001
> 2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_000031_1
> 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1 sent 
> events: (0-1).
> 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_1 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> {noformat}
> The IPC thread logs some other activity 250ms later, so it's not stuck nor 
> did it log an error.
> After this, there are no mentions of any tasks for 2622_1_06_000031 in the 
> log.
> The attempt was not running in LLAP.
> However, the attempt does come up again when the DAG is killed, long time 
> after it was stuck; presumably, AM failed to detect the failure (or heartbeat 
> failure for that matter?) and never tried to reschedule.
> {noformat}
> 2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, 
> creationTime=1479500743255, allocationTime=1479500743300, 
> startTime=1479500743301, finishTime=1479804314228, timeTaken=303570927, 
> status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL is 
> received. Killing attempt. Diagnostics: Killing tasks in vertex: 
> vertex_1478967587833_2622_1_06 [Map 1] due to trigger: DAG_TERMINATED, 
> nodeHttpAddress=http://1n02:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1
> {noformat}
> I do see some errors elsewhere to this node (w/Connection refused), but they 
> happen later. Let me know if you need anything else from AM log.
> {noformat}
> 2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to 
> connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded 
> maximum allowed time (in MILLISECONDS): 16000
> 2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to 
> connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded 
> maximum allowed time (in MILLISECONDS): 16000
> {noformat}
> It seems like a parallel query caused the memory exhaustion on LLAP; most 
> mappers for this job eventually completed successfully, a few were lost as 
> per above, and the DAG got stuck because of that.
> Relevant slider containers and restarts (the last one seems to be the one 
> where the attempt was running).
> Unfortunately the logs from relevant LLAP containers are missing from 
> aggregation.
> {noformat}
> 2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR 
> appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
> id='container_e55_1478967587833_2465_01_000005', 
> container=ContainerID=container_e55_1478967587833_2465_01_000005 
> nodeID=3n01:45454  ... diagnostics='Container 
> [pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is 
> running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB 
> physical memory used; 141.8 GB of 294.5 GB virtual memory used. Killing 
> container.
> 2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR 
> appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
> id='container_e55_1478967587833_2465_01_000003', 
> container=ContainerID=container_e55_1478967587833_2465_01_000003 
> nodeID=1n02:45454  ... diagnostics='Container 
> [pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is 
> running beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB 
> physical memory used; 142.5 GB of 294.5 GB virtual memory used. Killing 
> container.
> 2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR 
> appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
> id='container_e55_1478967587833_2465_01_000007', 
> container=ContainerID=container_e55_1478967587833_2465_01_000007 
> nodeID=3n01:45454  ... diagnostics='Container 
> [pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is 
> running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB 
> physical memory used; 143.2 GB of 294.5 GB virtual memory used. Killing 
> container.
> 2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR 
> appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', 
> id='container_e55_1478967587833_2465_01_000009', 
> container=ContainerID=container_e55_1478967587833_2465_01_000009 
> nodeID=1n02:45454  ... diagnostics='Container 
> [pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is 
> running beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB 
> physical memory used; 144.4 GB of 294.5 GB virtual memory used. Killing 
> container.
> 2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO  
> agent.AgentProviderService - Component operation. Status: COMPLETED; new 
> container state: HEALTHY; new component state: STARTED
> 2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO  
> agent.AgentProviderService - Component operation. Status: COMPLETED; new 
> container state: HEALTHY; new component state: STARTED
> 2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO  
> agent.AgentProviderService - Component operation. Status: COMPLETED; new 
> container state: HEALTHY; new component state: STARTED
> 2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO  
> agent.AgentProviderService - Component operation. Status: COMPLETED; new 
> container state: HEALTHY; new component state: STARTED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

Reply via email to