[ https://issues.apache.org/jira/browse/HIVE-22687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Himanshu Mishra updated HIVE-22687: ----------------------------------- Attachment: HIVE-22687.02.patch Status: Patch Available (was: Open) Resubmitting same patch to rerun unrelated test failure. > Query hangs indefinitely if LLAP daemon registers after the query is submitted > ------------------------------------------------------------------------------ > > Key: HIVE-22687 > URL: https://issues.apache.org/jira/browse/HIVE-22687 > Project: Hive > Issue Type: Bug > Components: llap > Affects Versions: 3.1.0 > Reporter: Himanshu Mishra > Assignee: Himanshu Mishra > Priority: Major > Attachments: HIVE-22687.01.patch, HIVE-22687.02.patch > > > If a query is submitted and no LLAP daemon is running, it waits for 1 minute > and times out with error {{SERVICE_UNAVAILABLE}}. > While waiting, if a new LLAP Daemon starts, then the timeout is cancelled, > and the tasks do not get scheduled as well. As a result, the query hangs > indefinitely. > This is due to the race condition where LLAP Daemon first registers the LLAP > instance at {{.../workers/worker-0000}}, and afterwards registers > {{.../workers/slot-0000}}. In the gap between two, Tez AM gets notified of > worker zk node and while processing it checks if slot zk node is present, if > not it rejects the LLAP Daemon. Error in Tez AM is: > {code:java} > [INFO] [LlapScheduler] |impl.LlapZookeeperRegistryImpl|: Unknown slot for > 8ebfdc45-0382-4757-9416-52898885af90{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)