[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-12-07 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Committed to master. Thanks for the review!

> AM may lose task failures and not reschedule when scheduling to LLAP
> 
>
> Key: HIVE-15296
> URL: https://issues.apache.org/jira/browse/HIVE-15296
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Fix For: 2.2.0
>
> Attachments: HIVE-15296.01.patch, HIVE-15296.patch, HIVE-15296.patch
>
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> capability=, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_1_2622_01_56 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=10550817928, containerId=container_1_2622_01_56, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_31_0] started. Is using containerId: 
> [container_1_2622_01_56] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> startTime=1479500403427, containerId=container_1_2622_01_56, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_31_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] 
> |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: 
> attempt_1478967587833_2622_1_06_31_0 failed: true container: 
> container_1_2622_01_56 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1478967587833_2622_1_07 [Reducer 2] attempt: 
> attempt_1478967587833_2622_1_06_31_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
> capability=

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-12-06 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

Attachment: HIVE-15296.01.patch

Minor update

> AM may lose task failures and not reschedule when scheduling to LLAP
> 
>
> Key: HIVE-15296
> URL: https://issues.apache.org/jira/browse/HIVE-15296
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15296.01.patch, HIVE-15296.patch, HIVE-15296.patch
>
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> capability=, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_1_2622_01_56 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=10550817928, containerId=container_1_2622_01_56, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_31_0] started. Is using containerId: 
> [container_1_2622_01_56] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> startTime=1479500403427, containerId=container_1_2622_01_56, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_31_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] 
> |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: 
> attempt_1478967587833_2622_1_06_31_0 failed: true container: 
> container_1_2622_01_56 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1478967587833_2622_1_07 [Reducer 2] attempt: 
> attempt_1478967587833_2622_1_06_31_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
> capability=, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-12-02 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

Attachment: HIVE-15296.patch

> AM may lose task failures and not reschedule when scheduling to LLAP
> 
>
> Key: HIVE-15296
> URL: https://issues.apache.org/jira/browse/HIVE-15296
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15296.patch, HIVE-15296.patch
>
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> capability=, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_1_2622_01_56 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=10550817928, containerId=container_1_2622_01_56, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_31_0] started. Is using containerId: 
> [container_1_2622_01_56] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> startTime=1479500403427, containerId=container_1_2622_01_56, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_31_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] 
> |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: 
> attempt_1478967587833_2622_1_06_31_0 failed: true container: 
> container_1_2622_01_56 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1478967587833_2622_1_07 [Reducer 2] attempt: 
> attempt_1478967587833_2622_1_06_31_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
> capability=, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_147896

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-11-30 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

Status: Patch Available  (was: Open)

> AM may lose task failures and not reschedule when scheduling to LLAP
> 
>
> Key: HIVE-15296
> URL: https://issues.apache.org/jira/browse/HIVE-15296
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15296.patch
>
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> capability=, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_1_2622_01_56 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=10550817928, containerId=container_1_2622_01_56, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_31_0] started. Is using containerId: 
> [container_1_2622_01_56] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> startTime=1479500403427, containerId=container_1_2622_01_56, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_31_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] 
> |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: 
> attempt_1478967587833_2622_1_06_31_0 failed: true container: 
> container_1_2622_01_56 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1478967587833_2622_1_07 [Reducer 2] attempt: 
> attempt_1478967587833_2622_1_06_31_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
> capability=, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_26

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-11-30 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

Attachment: HIVE-15296.patch

Not 100% sure of the root cause.
1) The difference in heartbeat mechanisms before and after the task is running 
is actually irrelevant, since it is only present on LLAP side.
2) The tracking and killing itself is done by Tez AM. Unless there's a bug in 
Tez, which I cannot see in the code, it should work.

The problem must be in the part that connects these together. Upon examination, 
I wonder if having heartbeats from the node (host+port) blindly heartbeat all 
the tasks that AM *thinks* are on the node is the culrpit? The restarted LLAP 
can heartbeat in time. 
This patch propagates unique daemonId usage to node heartbeats (and submit task 
response, to know the original unique node ID for the task in the first place).
I left some TODO comments to improve the handling for this, but for now I'd 
rather not rewrite EntityTracker entirely to use unique ID everywhere.

[~sseth] [~prasanth_j] can you take a look?

> AM may lose task failures and not reschedule when scheduling to LLAP
> 
>
> Key: HIVE-15296
> URL: https://issues.apache.org/jira/browse/HIVE-15296
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15296.patch
>
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> capability=, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_1_2622_01_56 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=10550817928, containerId=container_1_2622_01_56, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_31_0] started. Is using containerId: 
> [container_1_2622_01_56] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> startTime=1479500403427, containerId=container_1_2622_01_56, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_31_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher 

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-11-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

Attachment: (was: HIVE-15279.02.patch)

> AM may lose task failures and not reschedule when scheduling to LLAP
> 
>
> Key: HIVE-15296
> URL: https://issues.apache.org/jira/browse/HIVE-15296
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> capability=, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_1_2622_01_56 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=10550817928, containerId=container_1_2622_01_56, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_31_0] started. Is using containerId: 
> [container_1_2622_01_56] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> startTime=1479500403427, containerId=container_1_2622_01_56, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_31_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] 
> |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: 
> attempt_1478967587833_2622_1_06_31_0 failed: true container: 
> container_1_2622_01_56 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1478967587833_2622_1_07 [Reducer 2] attempt: 
> attempt_1478967587833_2622_1_06_31_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
> capability=, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
> star

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-11-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

Attachment: HIVE-15279.02.patch

fixing the NPE in the latest patch

> AM may lose task failures and not reschedule when scheduling to LLAP
> 
>
> Key: HIVE-15296
> URL: https://issues.apache.org/jira/browse/HIVE-15296
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
> Attachments: HIVE-15279.02.patch
>
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> capability=, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: Assigned task 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
> localityDelayTimeout=9223372036854775807} to container 
> container_1_2622_01_56 on node=DynamicServiceInstance 
> [alive=true, host=3n01:15001 with resources=, 
> shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
> startTime=10550817928, containerId=container_1_2622_01_56, 
> assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
> resources=, shufflePort=15551, 
> servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
> localityDelayTimeout=9223372036854775807} = SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: TaskAttempt: 
> [attempt_1478967587833_2622_1_06_31_0] started. Is using containerId: 
> [container_1_2622_01_56] on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> startTime=1479500403427, containerId=container_1_2622_01_56, 
> nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
> |tezplugins.LlapTaskCommunicator|: Successfully launched task: 
> attempt_1478967587833_2622_1_06_31_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
> events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
> |impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
> [0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
> creationTime=1479500401929, allocationTime=1479500403426, 
> startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
> status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out 
> after 300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
> task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] 
> |node.AMNodeImpl|: Attempt failed on node: 3n01:15001 TA: 
> attempt_1478967587833_2622_1_06_31_0 failed: true container: 
> container_1_2622_01_56 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1478967587833_2622_1_07 [Reducer 2] attempt: 
> attempt_1478967587833_2622_1_06_31_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
> task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
> capability=, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
> |tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
> TaskIn

[jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP

2016-11-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15296:

Description: 
First attempt and failure detection:
{noformat}
2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
capability=, hosts=[3n01]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Assigned task 
TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=55, 
localityDelayTimeout=9223372036854775807} to container 
container_1_2622_01_56 on node=DynamicServiceInstance [alive=true, 
host=3n01:15001 with resources=, 
shufflePort=15551, servicesAddress=http://3n01:15002, mgmtPort=15004]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_31_0, priority=65, 
startTime=10550817928, containerId=container_1_2622_01_56, 
assignedInstance=DynamicServiceInstance [alive=true, host=3n01:15001 with 
resources=, shufflePort=15551, 
servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, 
localityDelayTimeout=9223372036854775807} = SCHEDULED
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: TaskAttempt: [attempt_1478967587833_2622_1_06_31_0] 
started. Is using containerId: [container_1_2622_01_56] on NM: 
[3n01:15001]
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
startTime=1479500403427, containerId=container_1_2622_01_56, 
nodeId=3n01:15001
2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] 
|tezplugins.LlapTaskCommunicator|: Successfully launched task: 
attempt_1478967587833_2622_1_06_31_0
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
|impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_31_0 sent 
events: (0-1).
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] 
|impl.VertexImpl|: Sending attempt_1478967587833_2622_1_06_31_0 24 events 
[0,24) total 24 vertex_1478967587833_2622_1_06 [Map 1]
2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_31_0, 
creationTime=1479500401929, allocationTime=1479500403426, 
startTime=1479500403427, finishTime=1479500743249, timeTaken=339822, 
status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, 
diagnostics=AttemptID:attempt_1478967587833_2622_1_06_31_0 Timed out after 
300 secs, nodeHttpAddress=http://3n01:15002, counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for 
task=attempt_1478967587833_2622_1_06_31_0, state=ASSIGNED, endReason=OTHER
2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: 
Attempt failed on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_31_0 
failed: true container: container_1_2622_01_56 numFailedTAs: 7
2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Source task attempt completed for vertex: vertex_1478967587833_2622_1_07 
[Reducer 2] attempt: attempt_1478967587833_2622_1_06_31_0 with state: 
FAILED vertexState: RUNNING
{noformat}
Second attempt:
{noformat}
2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] 
|tezplugins.LlapTaskSchedulerService|: Received allocateRequest. 
task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
capability=, hosts=null
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTimeout=9223372036854775807} = DELAYED_RESOURCES
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: Preempting for 
task=attempt_1478967587833_2622_1_06_31_1 on any available host
2016-11-18 20:25:43,298 [INFO] [LlapScheduler] 
|tezplugins.LlapTaskSchedulerService|: ScheduleResult for Task: 
TaskInfo{task=attempt_1478967587833_2622_1_06_31_1, priority=64, 
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, 
localityDelayTi