[ https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796620#comment-15796620 ]
Rajesh Balamohan commented on HIVE-15529: ----------------------------------------- Yes, issue is with "PathChildrenCache.getCurrentData(name)". This internally stores "path, childData". But currently code passes "getNodeIdentity()" which would always return null. Hence it was not able to re-enable the disabled node. {noformat} For example, PathChildrenCache has the following keys and values Keys: [/user-rbalamohan/llap0/workers/slot-0000000000, /user-rbalamohan/llap0/workers/worker-0000000416] Values: [ChildData{path='/user-rbalamohan/llap0/workers/slot-0000000000', stat=101148,101148,1483486257760,1483486257760,0,0,0,96572092669136166,36,0,101148... But as per NodeEnablerCallable, it ends up requesting based on nodeIdentity. Hence the issue. Here is the log snippet 2017-01-03 18:33:16,343 [INFO] [LlapSchedulerNodeEnabler] |tezplugins.LlapTaskSchedulerService|: Attempting to re-enable node: {machine-105:40396, id=f5c2afe6-79bb-4636-8357-6b9158bef4d2, stc=24} .. 2017-01-03 18:33:16,356 [INFO] [LlapSchedulerNodeEnabler] |tezplugins.LlapTaskSchedulerService|: Not re-enabling node: {machine-105:40396, id=f5c2afe6-79bb-4636-8357-6b9158bef4d2, stc=24}, since it is not present in the RegistryActiveNodeList {noformat} Patch fixes the issue by checking for nodeIdentity. > LLAP: TaskSchedulerService can get stuck when scheduleTask returns > DELAYED_RESOURCES > ------------------------------------------------------------------------------------ > > Key: HIVE-15529 > URL: https://issues.apache.org/jira/browse/HIVE-15529 > Project: Hive > Issue Type: Bug > Components: llap > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Priority: Critical > Attachments: HIVE-15529.1.patch > > > Easier way to simulate the issue: > 1. Start hive cli with "--hiveconf hive.execution.mode=llap" > 2. Run a sql script file (e.g sql script containing tpc-ds queries) > 3. In the middle of the run, press "ctrl+C" which would interrupt the current > job. This should not exit the hive cli yet. > 4. After sometime, launch the same SQL script in same cli. This would get > stuck indefinitely (waiting for computing the splits). > Even when cli is quit, AM runs forever until explicitly killed. > Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} > dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task > scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)