[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

Rajesh Balamohan (JIRA) Tue, 03 Jan 2017 16:03:07 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796620#comment-15796620
 ]


Rajesh Balamohan commented on HIVE-15529:
-----------------------------------------


Yes, issue is with "PathChildrenCache.getCurrentData(name)". This internally 
stores "path, childData". But currently code passes "getNodeIdentity()" which 
would always return null. Hence it was not able to re-enable the disabled node.


{noformat}
For example, PathChildrenCache has the following keys and values

Keys: [/user-rbalamohan/llap0/workers/slot-0000000000, 
/user-rbalamohan/llap0/workers/worker-0000000416]
Values: [ChildData{path='/user-rbalamohan/llap0/workers/slot-0000000000', 
stat=101148,101148,1483486257760,1483486257760,0,0,0,96572092669136166,36,0,101148...

But as per NodeEnablerCallable, it ends up requesting based on nodeIdentity. 
Hence the issue. Here is the log snippet

2017-01-03 18:33:16,343 [INFO] [LlapSchedulerNodeEnabler] 
|tezplugins.LlapTaskSchedulerService|: Attempting to re-enable node: 
{machine-105:40396, id=f5c2afe6-79bb-4636-8357-6b9158bef4d2, stc=24}
..
2017-01-03 18:33:16,356 [INFO] [LlapSchedulerNodeEnabler] 
|tezplugins.LlapTaskSchedulerService|: Not re-enabling node: 
{machine-105:40396, id=f5c2afe6-79bb-4636-8357-6b9158bef4d2, stc=24}, since it 
is not present in the RegistryActiveNodeList
{noformat}

Patch fixes the issue by checking for nodeIdentity.

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-15529
>                 URL: https://issues.apache.org/jira/browse/HIVE-15529
>             Project: Hive
>          Issue Type: Bug
>          Components: llap
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Critical
>         Attachments: HIVE-15529.1.patch
>
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

Reply via email to