[ https://issues.apache.org/jira/browse/IMPALA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818194#comment-16818194 ]
Tim Armstrong edited comment on IMPALA-8411 at 4/15/19 5:17 PM: ---------------------------------------------------------------- I think there's two related issues here - there is the eos behaviour tracked by IMPALA-7561 and a new behaviour that we hadn't noticed where cancellation is asynchronous when cause == NULL and synchronous when cause != NULL. I think what [~lv] is suggesting is defining the invariant that the Cancel() RPC *always* transitions the query into a EXCEPTION or CANCELLED state before it returns. (We don't currently use the CANCELLED state, but we intend to in future - IMPALA-1262) I'm not sure why the transition is asynchronous. It does actually seem like a bug that's masked because the query transitions into EXCEPTION via another mechanism. I may be missing something - maybe [~bikram.sngh91] has an idea? Here's the code that causes the behaviour in ClientRequestState::Cancel() {code} // If the query has reached a terminal state, no need to update the state. bool already_done = eos_ || operation_state_ == TOperationState::ERROR_STATE; if (!already_done && cause != NULL) { DCHECK(!cause->ok()); discard_result(UpdateQueryStatus(*cause)); query_events_->MarkEvent("Cancelled"); DCHECK_EQ(operation_state_, TOperationState::ERROR_STATE); } {code} The method comment needs updating too: {code} /// Cancels the child queries and the coordinator with the given cause. /// If cause is NULL, it assume this was deliberately cancelled by the user while in /// FINISHED operation state. Otherwise, sets state to ERROR_STATE (TODO: IMPALA-1262: /// use CANCELED_STATE). Does nothing if the query has reached EOS or already cancelled. /// /// Only returns an error if 'check_inflight' is true and the query is not yet /// in-flight. Otherwise, proceed and return Status::OK() even if the query isn't /// in-flight (for cleaning up after an error on the query issuing path). Status Cancel(bool check_inflight, const Status* cause) WARN_UNUSED_RESULT; {code} was (Author: tarmstrong): I think there's two related issues here - there is the eos behaviour tracked by IMPALA-7561 and I think what [~lv] is suggesting is defining the invariant that the Cancel() RPC *always* transitions the query into a EXCEPTION or CANCELLED state before it returns. (We don't currently use the CANCELLED state, but we intend to in future - IMPALA-1262) I'm not sure why the transition is asynchronous. It does actually seem like a bug that's masked because the query transitions into EXCEPTION via another mechanism. I may be missing something - maybe [~bikram.sngh91] has an idea? Here's the code that causes the behaviour in ClientRequestState::Cancel() {code} // If the query has reached a terminal state, no need to update the state. bool already_done = eos_ || operation_state_ == TOperationState::ERROR_STATE; if (!already_done && cause != NULL) { DCHECK(!cause->ok()); discard_result(UpdateQueryStatus(*cause)); query_events_->MarkEvent("Cancelled"); DCHECK_EQ(operation_state_, TOperationState::ERROR_STATE); } {code} The method comment needs updating too: {code} /// Cancels the child queries and the coordinator with the given cause. /// If cause is NULL, it assume this was deliberately cancelled by the user while in /// FINISHED operation state. Otherwise, sets state to ERROR_STATE (TODO: IMPALA-1262: /// use CANCELED_STATE). Does nothing if the query has reached EOS or already cancelled. /// /// Only returns an error if 'check_inflight' is true and the query is not yet /// in-flight. Otherwise, proceed and return Status::OK() even if the query isn't /// in-flight (for cleaning up after an error on the query issuing path). Status Cancel(bool check_inflight, const Status* cause) WARN_UNUSED_RESULT; {code} > A race during query cancellation by a client can make the client see it as > still running > ---------------------------------------------------------------------------------------- > > Key: IMPALA-8411 > URL: https://issues.apache.org/jira/browse/IMPALA-8411 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 3.3.0 > Reporter: Lars Volker > Priority: Major > Labels: query-lifecycle > > I observed that under load the check in > https://gerrit.cloudera.org/#/c/12530/8/tests/query_test/test_cancellation.py@229 > would sometimes fail. After some poking it appears that the client's > cancellation request does not update the operation state (because "cause", > the third parameter to Cancelnternal, defaults to nullptr). Then the client > disconnects and the connection close triggers another call to CancelInternal, > this time with cause = "Session closed". This is also what shows up in the > profile. If the server is under load, handling the closing connection can > take long enough so that a concurrent request for the state still returns > RUNNING. The aim of this fix was to make CancelOperation synchronous, i.e. > when it returns the subsequent call to GetOperationStatus would not return > RUNNING. > The comment on ClientRequestState::Cancel() suggests that this is omitted > when handling client cancellation because we expect that to happen regularly > for finished queries. In that case however, eos_ would be set and we wouldn't > update the state either, no? > /// Cancels the child queries and the coordinator with the given cause. > /// If cause is NULL, it assume this was deliberately cancelled by the user > while in > /// FINISHED operation state. Otherwise, sets state to ERROR_STATE (TODO: > IMPALA-1262: > /// use CANCELED_STATE). Does nothing if the query has reached EOS or already > cancelled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org