Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16911 )

Change subject: IMPALA-10413: fix impalad crashes when canceling the retrying 
query
......................................................................


Patch Set 10:

(9 comments)

Thanks for updating the patch. The fix generally makes sense to me. Still have 
some minor questions to discuss.

http://gerrit.cloudera.org:8080/#/c/16911/8/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/16911/8/be/src/runtime/query-driver.cc@273
PS8, Line 273: dleRetryFailure(&status, &error_msg, request_st
> Is thread wait_thread_ still running? If yes, what's the waiting point in C
After checking more code paths, I think I understand why we need 
retry_request_state->Finalize() here. But we need to make these comments more 
clear.

      // 'retried_client_request_state_' hasn't been updated at this point, so 
the active
      // ClientRequestState is still the original one. When HandleRetryFailure 
unregisters
      // the retried query, it actually finalizes the original 
ClientRequestState. So we
      // have to explicitly finalize 'retry_request_state', otherwise we'll hit 
some
      // illegal states in destroying it.

The illegal states include wait_thread_ not reset, coordinator still executing, 
etc.


http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@85
PS10, Line 85:   if (client_request_state_->query_id() != query_id) return 
nullptr;
It's not clear to me how we will go into this state. If we encounter errors in 
RetryQueryFromThread, it seems no one will get the new query id. Is it possible 
to add test coverage for this?


http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@270
PS10, Line 270:     if (!status.ok()) {
Could you also add test coverage for this case?


http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@284
PS10, Line 284:   if (!status.ok()) {
Could you also add test coverage for this case?


http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@292
PS10, Line 292: CHECK_QUERY_DRIVER_DELAY
I think we should mention "retry" to avoid confusion. What about 
"RETRY_DELAY_CHECKING_ORIGINAL_DRIVER"?


http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@300
PS10, Line 300: Encountered error checking the original query
nit: "Failed to retry query since the original query was unregistered"


http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py@721
PS10, Line 721:
Could you add codes to validate the original query is not marked as retried? We 
can get the original query id by handle.get_handle().id, and then get the 
original query profile by this id. '__validate_runtime_profiles' has some 
example codes.


http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py@722
PS10, Line 722:     # Close the query.
Do we really need to close a cancelled query?


http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py@723
PS10, Line 723:     self.client.close_query(handle)
We should check whether coordinator still lives. The cancel and close 
operations return immediately, but the crash may happen laster. So adding the 
following codes help to actuallly capture the bug:

    time.sleep(2)
    assert self.cluster.impalads[0].get_pid() is not None, "Coordinator crashed"

Without these, the test still pass even the coordinator crashes.



--
To view, visit http://gerrit.cloudera.org:8080/16911
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4fd7228acd0a70d33859029052239f9b9f795e5d
Gerrit-Change-Number: 16911
Gerrit-PatchSet: 10
Gerrit-Owner: Xianqing He <hexianqing...@126.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Xianqing He <hexianqing...@126.com>
Gerrit-Comment-Date: Mon, 11 Jan 2021 09:59:29 +0000
Gerrit-HasComments: Yes

Reply via email to