Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16911 )
Change subject: IMPALA-10413: fix impalad crashes when canceling the retrying query ...................................................................... Patch Set 10: (9 comments) Thanks for updating the patch. The fix generally makes sense to me. Still have some minor questions to discuss. http://gerrit.cloudera.org:8080/#/c/16911/8/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/16911/8/be/src/runtime/query-driver.cc@273 PS8, Line 273: dleRetryFailure(&status, &error_msg, request_st > Is thread wait_thread_ still running? If yes, what's the waiting point in C After checking more code paths, I think I understand why we need retry_request_state->Finalize() here. But we need to make these comments more clear. // 'retried_client_request_state_' hasn't been updated at this point, so the active // ClientRequestState is still the original one. When HandleRetryFailure unregisters // the retried query, it actually finalizes the original ClientRequestState. So we // have to explicitly finalize 'retry_request_state', otherwise we'll hit some // illegal states in destroying it. The illegal states include wait_thread_ not reset, coordinator still executing, etc. http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@85 PS10, Line 85: if (client_request_state_->query_id() != query_id) return nullptr; It's not clear to me how we will go into this state. If we encounter errors in RetryQueryFromThread, it seems no one will get the new query id. Is it possible to add test coverage for this? http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@270 PS10, Line 270: if (!status.ok()) { Could you also add test coverage for this case? http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@284 PS10, Line 284: if (!status.ok()) { Could you also add test coverage for this case? http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@292 PS10, Line 292: CHECK_QUERY_DRIVER_DELAY I think we should mention "retry" to avoid confusion. What about "RETRY_DELAY_CHECKING_ORIGINAL_DRIVER"? http://gerrit.cloudera.org:8080/#/c/16911/10/be/src/runtime/query-driver.cc@300 PS10, Line 300: Encountered error checking the original query nit: "Failed to retry query since the original query was unregistered" http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py File tests/custom_cluster/test_query_retries.py: http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py@721 PS10, Line 721: Could you add codes to validate the original query is not marked as retried? We can get the original query id by handle.get_handle().id, and then get the original query profile by this id. '__validate_runtime_profiles' has some example codes. http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py@722 PS10, Line 722: # Close the query. Do we really need to close a cancelled query? http://gerrit.cloudera.org:8080/#/c/16911/10/tests/custom_cluster/test_query_retries.py@723 PS10, Line 723: self.client.close_query(handle) We should check whether coordinator still lives. The cancel and close operations return immediately, but the crash may happen laster. So adding the following codes help to actuallly capture the bug: time.sleep(2) assert self.cluster.impalads[0].get_pid() is not None, "Coordinator crashed" Without these, the test still pass even the coordinator crashes. -- To view, visit http://gerrit.cloudera.org:8080/16911 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4fd7228acd0a70d33859029052239f9b9f795e5d Gerrit-Change-Number: 16911 Gerrit-PatchSet: 10 Gerrit-Owner: Xianqing He <hexianqing...@126.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com> Gerrit-Reviewer: Xianqing He <hexianqing...@126.com> Gerrit-Comment-Date: Mon, 11 Jan 2021 09:59:29 +0000 Gerrit-HasComments: Yes