[jira] [Updated] (HIVE-15255) LLAP: service_busy error should not be retried so fast

Sergey Shelukhin (JIRA) Mon, 21 Nov 2016 17:00:07 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HIVE-15255:
------------------------------------
    Description: 
{noformat}
2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, 
containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, 
containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, 
containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, 
containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
{noformat}

As you can see by the attempt number, this has been going on for a while. In 
fact I think other tasks could have been scheduled in the time (not sure), but 
the thread just kept at it for this one task until it was finally scheduled.
There should be some fallback after initial failures; we should also make sure 
such retries do not take over all scheduling (not sure if they do, need to 
check).

LLAP on the node was alive, just busy with other tasks. The task did eventually 
get scheduled.

  was:
{noformat}
2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, 
containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, 
containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, 
containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, 
containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, 
taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, 
status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
nodeHttpAddress=(node3), counters=Counters: 1, 
org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
{noformat}

As you can see by the attempt number, this has been going on for a while. In 
fact I think other tasks could have been scheduled in the time (not sure), but 
the thread just kept at it for this one task until it was finally scheduled.
There should be some fallback after initial failures; we should also make sure 
such retries do not take over all scheduling (not sure if they do, need to 
check).


> LLAP: service_busy error should not be retried so fast
> ------------------------------------------------------
>
>                 Key: HIVE-15255
>                 URL: https://issues.apache.org/jira/browse/HIVE-15255
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> {noformat}
> 2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, 
> containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
> 2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, 
> containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
> 2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, 
> containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
> 2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, 
> containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
> 2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> {noformat}
> As you can see by the attempt number, this has been going on for a while. In 
> fact I think other tasks could have been scheduled in the time (not sure), 
> but the thread just kept at it for this one task until it was finally 
> scheduled.
> There should be some fallback after initial failures; we should also make 
> sure such retries do not take over all scheduling (not sure if they do, need 
> to check).
> LLAP on the node was alive, just busy with other tasks. The task did 
> eventually get scheduled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15255) LLAP: service_busy error should not be retried so fast

Reply via email to