[jira] [Comment Edited] (IMPALA-8411) A race during query cancellation by a client can make the client see it as still running

Tim Armstrong (JIRA) Mon, 15 Apr 2019 10:18:12 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818194#comment-16818194
 ]


Tim Armstrong edited comment on IMPALA-8411 at 4/15/19 5:17 PM:
----------------------------------------------------------------

I think there's two related issues here - there is the eos behaviour tracked by 
IMPALA-7561 and 
a new behaviour that we hadn't noticed where cancellation is asynchronous when 
cause == NULL and synchronous when cause != NULL.

I think what [~lv] is suggesting is defining the invariant that the Cancel() 
RPC *always* transitions the query into a EXCEPTION or CANCELLED state before 
it returns. (We don't currently use the CANCELLED state, but we intend to in 
future -  IMPALA-1262)

I'm not sure why the transition is asynchronous. It does actually seem like a 
bug that's masked because the query transitions into EXCEPTION via another 
mechanism. I may be missing something - maybe [~bikram.sngh91] has an idea?

Here's the code that causes the behaviour in ClientRequestState::Cancel()
{code}
    // If the query has reached a terminal state, no need to update the state.
    bool already_done = eos_ || operation_state_ == 
TOperationState::ERROR_STATE;
    if (!already_done && cause != NULL) {
      DCHECK(!cause->ok());
      discard_result(UpdateQueryStatus(*cause));
      query_events_->MarkEvent("Cancelled");
      DCHECK_EQ(operation_state_, TOperationState::ERROR_STATE);
    }
{code}

The method comment needs updating too:
{code}
  /// Cancels the child queries and the coordinator with the given cause.
  /// If cause is NULL, it assume this was deliberately cancelled by the user 
while in
  /// FINISHED operation state. Otherwise, sets state to ERROR_STATE (TODO: 
IMPALA-1262:
  /// use CANCELED_STATE). Does nothing if the query has reached EOS or already 
cancelled.
  ///
  /// Only returns an error if 'check_inflight' is true and the query is not yet
  /// in-flight. Otherwise, proceed and return Status::OK() even if the query 
isn't
  /// in-flight (for cleaning up after an error on the query issuing path).
  Status Cancel(bool check_inflight, const Status* cause) WARN_UNUSED_RESULT;
{code}



was (Author: tarmstrong):
I think there's two related issues here - there is the eos behaviour tracked by 
IMPALA-7561 and 

I think what [~lv] is suggesting is defining the invariant that the Cancel() 
RPC *always* transitions the query into a EXCEPTION or CANCELLED state before 
it returns. (We don't currently use the CANCELLED state, but we intend to in 
future -  IMPALA-1262)

I'm not sure why the transition is asynchronous. It does actually seem like a 
bug that's masked because the query transitions into EXCEPTION via another 
mechanism. I may be missing something - maybe [~bikram.sngh91] has an idea?

Here's the code that causes the behaviour in ClientRequestState::Cancel()
{code}
    // If the query has reached a terminal state, no need to update the state.
    bool already_done = eos_ || operation_state_ == 
TOperationState::ERROR_STATE;
    if (!already_done && cause != NULL) {
      DCHECK(!cause->ok());
      discard_result(UpdateQueryStatus(*cause));
      query_events_->MarkEvent("Cancelled");
      DCHECK_EQ(operation_state_, TOperationState::ERROR_STATE);
    }
{code}

The method comment needs updating too:
{code}
  /// Cancels the child queries and the coordinator with the given cause.
  /// If cause is NULL, it assume this was deliberately cancelled by the user 
while in
  /// FINISHED operation state. Otherwise, sets state to ERROR_STATE (TODO: 
IMPALA-1262:
  /// use CANCELED_STATE). Does nothing if the query has reached EOS or already 
cancelled.
  ///
  /// Only returns an error if 'check_inflight' is true and the query is not yet
  /// in-flight. Otherwise, proceed and return Status::OK() even if the query 
isn't
  /// in-flight (for cleaning up after an error on the query issuing path).
  Status Cancel(bool check_inflight, const Status* cause) WARN_UNUSED_RESULT;
{code}


> A race during query cancellation by a client can make the client see it as 
> still running
> ----------------------------------------------------------------------------------------
>
>                 Key: IMPALA-8411
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8411
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.3.0
>            Reporter: Lars Volker
>            Priority: Major
>              Labels: query-lifecycle
>
> I observed that under load the check in 
> https://gerrit.cloudera.org/#/c/12530/8/tests/query_test/test_cancellation.py@229
>  would sometimes fail. After some poking it appears that the client's 
> cancellation request does not update the operation state (because "cause", 
> the third parameter to Cancelnternal, defaults to nullptr). Then the client 
> disconnects and the connection close triggers another call to CancelInternal, 
> this time with cause = "Session closed". This is also what shows up in the 
> profile. If the server is under load, handling the closing connection can 
> take long enough so that a concurrent request for the state still returns 
> RUNNING. The aim of this fix was to make CancelOperation synchronous, i.e. 
> when it returns the subsequent call to GetOperationStatus would not return 
> RUNNING.
> The comment on ClientRequestState::Cancel() suggests that this is omitted 
> when handling client cancellation because we expect that to happen regularly 
> for finished queries. In that case however, eos_ would be set and we wouldn't 
> update the state either, no?
> /// Cancels the child queries and the coordinator with the given cause.
> /// If cause is NULL, it assume this was deliberately cancelled by the user 
> while in
> /// FINISHED operation state. Otherwise, sets state to ERROR_STATE (TODO: 
> IMPALA-1262:
> /// use CANCELED_STATE). Does nothing if the query has reached EOS or already 
> cancelled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Comment Edited] (IMPALA-8411) A race during query cancellation by a client can make the client see it as still running

Reply via email to