Sahil Takiar created IMPALA-9113: ------------------------------------ Summary: Queries can hang if an impalad is killed after a query has FINISHED Key: IMPALA-9113 URL: https://issues.apache.org/jira/browse/IMPALA-9113 Project: IMPALA Issue Type: Bug Components: Backend, Clients Reporter: Sahil Takiar Assignee: Sahil Takiar
There is a race condition in the query coordination code that could cause queries to hang indefinitely in an un-cancellable state if an impalad crashes after the query has transitioned to the FINISHED state, but before all backends have completed. The issue occurs if: * A query produces all results * A client issues a fetch request to read all of those results * The client fetch request fetches all available rows (e.g. eos is hit) * {{Coordinator::GetNext}} then calls {{SetNonErrorTerminalState(ExecState::RETURNED_RESULTS)}} which eventually calls {{WaitForBackends()}} * {{WaitForBackends()}} will block until all backends have completed * One of the impalads running the query crashes, and thus never reports success for the query fragment it was running * The {{WaitForBackends()}} call will then block indefinitely * Any attempt to cancel the query fails because the original fetch request that drove the {{WaitForBackends()}} call has acquired the {{ClientRequestState}} lock, which thus prevents any cancellation from occurring. Implementing IMPALA-6984 should theoretically fix because as soon as eos is hit, it would call {{CancelBackends()}} rather than {{WaitForBackends()}}. Another solution would be to add a timeout to the {{WaitForBackends()}} so that it returns after the timeout is hit, this would force the fetch request to return 0 rows with {{hasMoreRows=true}}, and unblock any cancellation threads. -- This message was sent by Atlassian Jira (v8.3.4#803005)