[ https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-15296: ------------------------------------ Description: First attempt and failure detection: {noformat} 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_0, priority=65, capability=<memory:4096, vCores:1>, hosts=[3n01] 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=0, containerId=null, assignedInstance=null, uniqueId=55, localityDelayTimeout=9223372036854775807} to container container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true, host=3n01:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004] 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=10550817928, containerId=container_222212222_2622_01_000056, assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, localityDelayTimeout=9223372036854775807} = SCHEDULED 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: [container_222212222_2622_01_000056] on NM: [3n01:15001] 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, startTime=1479500403427, containerId=container_222212222_2622_01_000056, nodeId=3n01:15001 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] |tezplugins.LlapTaskCommunicator|: Successfully launched task: attempt_1478967587833_2622_1_06_000031_0 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0 sent events: (0-1). 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_0 24 events [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1] 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, creationTime=1479500401929, allocationTime=1479500403426, startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED, endReason=OTHER 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 failed: true container: container_222212222_2622_01_000056 numFailedTAs: 7 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1478967587833_2622_1_07 [Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0 with state: FAILED vertexState: RUNNING {noformat} Second attempt: {noformat} 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_1, priority=64, capability=<memory:4096, vCores:1>, hosts=null 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Preempting for task=attempt_1478967587833_2622_1_06_000031_1 on any available host 2016-11-18 20:25:43,298 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES 2016-11-18 20:25:43,299 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} to container container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true, host=1n02:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004] 2016-11-18 20:25:43,300 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=10551159245, containerId=container_222212222_2622_01_000164, assignedInstance=DynamicServiceInstance [alive=true, host=1n02:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, localityDelayTimeout=9223372036854775807} = SCHEDULED 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: [container_222212222_2622_01_000164] on NM: [1n02:15001] 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, startTime=1479500743301, containerId=container_222212222_2622_01_000164, nodeId=1n02:15001 2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] |tezplugins.LlapTaskCommunicator|: Successfully launched task: attempt_1478967587833_2622_1_06_000031_1 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1 sent events: (0-1). 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_1 24 events [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1] {noformat} The IPC thread logs some other activity 250ms later, so it's not stuck nor did it log an error. After this, there are no mentions of any tasks for 2622_1_06_000031 in the log. The attempt was not running in LLAP. However, the attempt does come up again when the DAG is killed, long time after it was stuck; presumably, AM failed to detect the failure (or heartbeat failure for that matter?) and never tried to reschedule. {noformat} 2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, creationTime=1479500743255, allocationTime=1479500743300, startTime=1479500743301, finishTime=1479804314228, timeTaken=303570927, status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL is received. Killing attempt. Diagnostics: Killing tasks in vertex: vertex_1478967587833_2622_1_06 [Map 1] due to trigger: DAG_TERMINATED, nodeHttpAddress=http://1n02:15002, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1 {noformat} I do see some errors elsewhere to this node (w/Connection refused), but they happen later. Let me know if you need anything else from AM log. {noformat} 2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS): 16000 2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS): 16000 {noformat} It seems like a parallel query caused the memory exhaustion on LLAP; most mappers for this job eventually completed successfully, a few were lost as per above, and the DAG got stuck because of that. Relevant slider containers and restarts (the last one seems to be the one where the attempt was running). Unfortunately the logs from relevant LLAP containers are missing from aggregation. {noformat} 2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000005', container=ContainerID=container_e55_1478967587833_2465_01_000005 nodeID=3n01:45454 ... diagnostics='Container [pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 141.8 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000003', container=ContainerID=container_e55_1478967587833_2465_01_000003 nodeID=1n02:45454 ... diagnostics='Container [pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is running beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB physical memory used; 142.5 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000007', container=ContainerID=container_e55_1478967587833_2465_01_000007 nodeID=3n01:45454 ... diagnostics='Container [pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 143.2 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000009', container=ContainerID=container_e55_1478967587833_2465_01_000009 nodeID=1n02:45454 ... diagnostics='Container [pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is running beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB physical memory used; 144.4 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED 2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED 2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED 2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED {noformat} was: First attempt and failure detection: {noformat} 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_0, priority=65, capability=<memory:4096, vCores:1>, hosts=[3n01] 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=0, containerId=null, assignedInstance=null, uniqueId=55, localityDelayTimeout=9223372036854775807} to container container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true, host=3n01:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004] 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=10550817928, containerId=container_222212222_2622_01_000056, assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, localityDelayTimeout=9223372036854775807} = SCHEDULED 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: [container_222212222_2622_01_000056] on NM: [3n01:15001] 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, startTime=1479500403427, containerId=container_222212222_2622_01_000056, nodeId=3n01:15001 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] |tezplugins.LlapTaskCommunicator|: Successfully launched task: attempt_1478967587833_2622_1_06_000031_0 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0 sent events: (0-1). 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_0 24 events [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1] 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, creationTime=1479500401929, allocationTime=1479500403426, startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED, endReason=OTHER 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 failed: true container: container_222212222_2622_01_000056 numFailedTAs: 7 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1478967587833_2622_1_07 [Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0 with state: FAILED vertexState: RUNNING {noformat} Second attempt: {noformat} 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_1, priority=64, capability=<memory:4096, vCores:1>, hosts=null 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Preempting for task=attempt_1478967587833_2622_1_06_000031_1 on any available host 2016-11-18 20:25:43,298 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES 2016-11-18 20:25:43,299 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} to container container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true, host=1n02:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004] 2016-11-18 20:25:43,300 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=10551159245, containerId=container_222212222_2622_01_000164, assignedInstance=DynamicServiceInstance [alive=true, host=1n02:15001 with resources=<memory:59392, vCores:16>, shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, localityDelayTimeout=9223372036854775807} = SCHEDULED 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: [container_222212222_2622_01_000164] on NM: [1n02:15001] 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, startTime=1479500743301, containerId=container_222212222_2622_01_000164, nodeId=1n02:15001 2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] |tezplugins.LlapTaskCommunicator|: Successfully launched task: attempt_1478967587833_2622_1_06_000031_1 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1 sent events: (0-1). 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_1 24 events [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1] {noformat} The IPC thread logs some other activity 250ms later, so it's not stuck nor did it log an error. After this, there are no mentions of any tasks for 2622_1_06_000031 in the log. The attempt was not running in LLAP. However, the attempt does come up again when the DAG is killed, long time after it was stuck: {noformat} 2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, creationTime=1479500743255, allocationTime=1479500743300, startTime=1479500743301, finishTime=1479804314228, timeTaken=303570927, status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL is received. Killing attempt. Diagnostics: Killing tasks in vertex: vertex_1478967587833_2622_1_06 [Map 1] due to trigger: DAG_TERMINATED, nodeHttpAddress=http://1n02:15002, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1 {noformat} I do see some errors elsewhere to this node (w/Connection refused), but they happen later. Let me know if you need anything else from AM log. {noformat} 2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS): 16000 2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS): 16000 {noformat} Relevant slider containers and restarts (the last one seems to be the one where the attempt was running). Unfortunately the logs from relevant LLAP containers are missing from aggregation. {noformat} 2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000005', container=ContainerID=container_e55_1478967587833_2465_01_000005 nodeID=3n01:45454 ... diagnostics='Container [pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 141.8 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000003', container=ContainerID=container_e55_1478967587833_2465_01_000003 nodeID=1n02:45454 ... diagnostics='Container [pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is running beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB physical memory used; 142.5 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000007', container=ContainerID=container_e55_1478967587833_2465_01_000007 nodeID=3n01:45454 ... diagnostics='Container [pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 143.2 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000009', container=ContainerID=container_e55_1478967587833_2465_01_000009 nodeID=1n02:45454 ... diagnostics='Container [pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is running beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB physical memory used; 144.4 GB of 294.5 GB virtual memory used. Killing container. 2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED 2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED 2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED 2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO agent.AgentProviderService - Component operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED {noformat} > AM may lose task failures and not reschedule when scheduling to LLAP > -------------------------------------------------------------------- > > Key: HIVE-15296 > URL: https://issues.apache.org/jira/browse/HIVE-15296 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Siddharth Seth > > First attempt and failure detection: > {noformat} > 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] > |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. > task=attempt_1478967587833_2622_1_06_000031_0, priority=65, > capability=<memory:4096, vCores:1>, hosts=[3n01] > 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: Assigned task > TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, > startTime=0, containerId=null, assignedInstance=null, uniqueId=55, > localityDelayTimeout=9223372036854775807} to container > container_222212222_2622_01_000056 on node=DynamicServiceInstance > [alive=true, host=3n01:15001 with resources=<memory:59392, vCores:16>, > shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004] > 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: > TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, > startTime=10550817928, containerId=container_222212222_2622_01_000056, > assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with > resources=<memory:59392, vCores:16>, shufflePort=15551, > servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, > localityDelayTimeout=9223372036854775807} = SCHEDULED > 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: TaskAttempt: > [attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: > [container_222212222_2622_01_000056] on NM: [3n01:15001] > 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: > vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, > startTime=1479500403427, containerId=container_222212222_2622_01_000056, > nodeId=3n01:15001 > 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] > |tezplugins.LlapTaskCommunicator|: Successfully launched task: > attempt_1478967587833_2622_1_06_000031_0 > 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] > |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0 sent > events: (0-1). > 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] > |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_0 24 events > [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1] > 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0, > creationTime=1479500401929, allocationTime=1479500403426, > startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, > diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out > after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 > 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] > |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for > task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED, endReason=OTHER > 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] > |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: > attempt_1478967587833_2622_1_06_000031_0 failed: true container: > container_222212222_2622_01_000056 numFailedTAs: 7 > 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] > |impl.VertexImpl|: Source task attempt completed for vertex: > vertex_1478967587833_2622_1_07 [Reducer 2] attempt: > attempt_1478967587833_2622_1_06_000031_0 with state: FAILED vertexState: > RUNNING > {noformat} > Second attempt: > {noformat} > 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] > |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. > task=attempt_1478967587833_2622_1_06_000031_1, priority=64, > capability=<memory:4096, vCores:1>, hosts=null > 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: > TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, > startTime=0, containerId=null, assignedInstance=null, uniqueId=318, > localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES > 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: Preempting for > task=attempt_1478967587833_2622_1_06_000031_1 on any available host > 2016-11-18 20:25:43,298 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: > TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, > startTime=0, containerId=null, assignedInstance=null, uniqueId=318, > localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES > 2016-11-18 20:25:43,299 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: Assigned task > TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, > startTime=0, containerId=null, assignedInstance=null, uniqueId=318, > localityDelayTimeout=9223372036854775807} to container > container_222212222_2622_01_000164 on node=DynamicServiceInstance > [alive=true, host=1n02:15001 with resources=<memory:59392, vCores:16>, > shufflePort=15551, servicesAddress=http://1n02:15002, mgmtPort=15004] > 2016-11-18 20:25:43,300 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: > TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, > startTime=10551159245, containerId=container_222212222_2622_01_000164, > assignedInstance=DynamicServiceInstance [alive=true, host=1n02:15001 with > resources=<memory:59392, vCores:16>, shufflePort=15551, > servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, > localityDelayTimeout=9223372036854775807} = SCHEDULED > 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: TaskAttempt: > [attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: > [container_222212222_2622_01_000164] on NM: [1n02:15001] > 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: > vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, > startTime=1479500743301, containerId=container_222212222_2622_01_000164, > nodeId=1n02:15001 > 2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] > |tezplugins.LlapTaskCommunicator|: Successfully launched task: > attempt_1478967587833_2622_1_06_000031_1 > 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] > |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1 sent > events: (0-1). > 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] > |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_000031_1 24 events > [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1] > {noformat} > The IPC thread logs some other activity 250ms later, so it's not stuck nor > did it log an error. > After this, there are no mentions of any tasks for 2622_1_06_000031 in the > log. > The attempt was not running in LLAP. > However, the attempt does come up again when the DAG is killed, long time > after it was stuck; presumably, AM failed to detect the failure (or heartbeat > failure for that matter?) and never tried to reschedule. > {noformat} > 2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1, > creationTime=1479500743255, allocationTime=1479500743300, > startTime=1479500743301, finishTime=1479804314228, timeTaken=303570927, > status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL is > received. Killing attempt. Diagnostics: Killing tasks in vertex: > vertex_1478967587833_2622_1_06 [Map 1] due to trigger: DAG_TERMINATED, > nodeHttpAddress=http://1n02:15002, counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1 > {noformat} > I do see some errors elsewhere to this node (w/Connection refused), but they > happen later. Let me know if you need anything else from AM log. > {noformat} > 2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to > connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded > maximum allowed time (in MILLISECONDS): 16000 > 2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to > connect to server: 1n02/(snip).2:15001: retries get failed due to exceeded > maximum allowed time (in MILLISECONDS): 16000 > {noformat} > It seems like a parallel query caused the memory exhaustion on LLAP; most > mappers for this job eventually completed successfully, a few were lost as > per above, and the DAG got stuck because of that. > Relevant slider containers and restarts (the last one seems to be the one > where the attempt was running). > Unfortunately the logs from relevant LLAP containers are missing from > aggregation. > {noformat} > 2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR > appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', > id='container_e55_1478967587833_2465_01_000005', > container=ContainerID=container_e55_1478967587833_2465_01_000005 > nodeID=3n01:45454 ... diagnostics='Container > [pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is > running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB > physical memory used; 141.8 GB of 294.5 GB virtual memory used. Killing > container. > 2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR > appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', > id='container_e55_1478967587833_2465_01_000003', > container=ContainerID=container_e55_1478967587833_2465_01_000003 > nodeID=1n02:45454 ... diagnostics='Container > [pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is > running beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB > physical memory used; 142.5 GB of 294.5 GB virtual memory used. Killing > container. > 2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR > appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', > id='container_e55_1478967587833_2465_01_000007', > container=ContainerID=container_e55_1478967587833_2465_01_000007 > nodeID=3n01:45454 ... diagnostics='Container > [pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is > running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB > physical memory used; 143.2 GB of 294.5 GB virtual memory used. Killing > container. > 2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR > appmaster.SliderAppMaster - Role instance RoleInstance{role='LLAP', > id='container_e55_1478967587833_2465_01_000009', > container=ContainerID=container_e55_1478967587833_2465_01_000009 > nodeID=1n02:45454 ... diagnostics='Container > [pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is > running beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB > physical memory used; 144.4 GB of 294.5 GB virtual memory used. Killing > container. > 2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO > agent.AgentProviderService - Component operation. Status: COMPLETED; new > container state: HEALTHY; new component state: STARTED > 2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO > agent.AgentProviderService - Component operation. Status: COMPLETED; new > container state: HEALTHY; new component state: STARTED > 2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO > agent.AgentProviderService - Component operation. Status: COMPLETED; new > container state: HEALTHY; new component state: STARTED > 2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO > agent.AgentProviderService - Component operation. Status: COMPLETED; new > container state: HEALTHY; new component state: STARTED > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)