Sergey Shelukhin created HIVE-15255: ---------------------------------------
Summary: LLAP: service_busy error should not be retried so fast Key: HIVE-15255 URL: https://issues.apache.org/jira/browse/HIVE-15255 Project: Hive Issue Type: Bug Reporter: Sergey Shelukhin {noformat} 2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, containerId=container_222212222_2622_01_012504, nodeId=(node3):15001 2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, containerId=container_222212222_2622_01_012511, nodeId=(node3):15001 2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, containerId=container_222212222_2622_01_012522, nodeId=(node3):15001 2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, containerId=container_222212222_2622_01_012529, nodeId=(node3):15001 2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 {noformat} As you can see by the attempt number, this has been going on for a while. In fact I think other tasks could have been scheduled in the time (not sure), but the thread just kept at it for this one task until it was finally scheduled. There should be some fallback after initial failures; we should also make sure such retries do not take over all scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)