Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 )
Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... Patch Set 27: (25 comments) Thanks for the changes. I know I've asked for a lot, but I think the patch is a lot better now. I still have a lot of concerns about this feature, esp. about our story around observability, but I think those concerns can mostly be addressed in follow up work and most of my remaining comments are more minor things. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079 PS26, Line 1079: > Before this patch, if a query fails: > ClientRequestState::FinishExecQueryOrDmlRequest() > --> Coordinator::Exec() --> Coordinator::StartBackendExec() --> > ClientRequestState::UpdateQueryStatus() which sets > ClientRequestState::query_status_. I don't think that's true - ClientRequestState::UpdateQueryStatus() is never called from inside Coordinator. If an Exec() rpc fails, before this patch Coordinator::Exec() will return an error status and ClientRequestState will call UpdateQueryStatus() on itself to set the error that's returned from Exec(). Now, MarkAsRetrying() will set an error on ClientRequestState but after that Coordinator::Exec() will still return an error which ClientRequestState will try to set on itself with UpdateQueryStatus() but it'll get discarded because an error was already set. As far as I can tell, prior to this patch if the Coordinator needed to communicate an error to the ClientRequestState it always did so by returning an error Status, eg. if a backend reports an error usually that gets bubbled up when the ClientRequestState calls GetNext(), and now we've added this other patch where the ClientRequestState calls into the Coordinator and during that call the Coordinator calls back into the ClientRequestState to set the error even though its also about to return another Status that represents the same error with a slightly different text. The point is, its messy and I'm concerned it will be error prone. > A lot of the re-factoring we end up > doing now might just be wasted work if we end up changing half the > logic. Sure, you don't have to take my suggestions, esp. if you have a good reason/a vision of your own for how to clean this all up. I'm trying to help make sure this patch is well designed because I think it'll make the further work easier, eg. minimizing the amount of retry related logic that ends up in Coordinator should make it easier to make further changes to how retries work. If you're really determined to keep the calls to TryQueryRetry in Coordinator in this patch, its fine with me. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910 PS27, Line 910: Status status = Status::OK(); I don't think declaring this here is necessary since its not used outside of the if/else immediately below http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58 PS27, Line 58: std::shared_ptr<QueryDriver> I think you might want to make this a '&' to avoid unnecessary copies http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93 PS27, Line 93: TryRetryQuery TryQueryRetry http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// > TryQueryRetry passes the status to > MarkAsRetrying which sets the given status as the query_status_ of > the ClientRequestState, which is what gets exposed in the runtime > profile. Right, but MarkAsRetrying doesn't get called in the case where you call AddDetail() because TryQueryRetry returns immediately after AddDetail() http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228 PS27, Line 228: // Run the new query. I think its fine to leave for now, but just wanted to note that the duplication of the query execution logic between here and ImpalaServer is unfortunate and it might be nice to have more of a QueryDriver::RunQuery() type function that works for both initial attempts and retries when we're doing refactoring. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@214 PS27, Line 214: Status InitExecRequest(const TQueryCtx& query_ctx); This is no longer used anywhere http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@222 PS27, Line 222: std::string ExecStateToString(ExecState state) const; static? http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@225 PS27, Line 225: std::string RetryStateToString(RetryState state) const; static? http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@228 PS27, Line 228: std::shared_ptr<ImpalaServer::SessionState> session() const { return session_; } I think this copies the shared_ptr, which adds another atomic inc/dec. Is there a reason we need to return a shared_ptr here? http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@556 PS27, Line 556: Updated by : /// UpdateQueryStatus No longer true, due to MarkAsRetrying http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@561 PS27, Line 561: initialized in InitExecRequest No longer true. Also maybe note that the QueryDriver owns this. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc@420 PS27, Line 420: Don't use : // GetActiveQueryHandle because clients can request profiles for unregistered queries. Not sure what this is supposed to mean http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@670 PS27, Line 670: query_driver query_handle http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@868 PS27, Line 868: /// query_state = union(beeswax::QueryState, ClientRequestState::RetryState). Probably requires more explanation, eg. "used for the http server..." http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@985 PS27, Line 985: 'request_state' needs to be updated http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@1024 PS27, Line 1024: ClientRequestState needs to be updated http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1160 PS24, Line 1160: // unregistration work. StartUnregister() succeeds for the first thread to call it to > This was based on https://gerrit.cloudera.org/#/c/15821/10 - basically ther Ah okay, I hadn't noticed the equivalent started_finalize thing in ClientRequestState that this is replacing. It might be nice to define like a QueryDriver::Finalize that calls StartUnregister and ClientRequestState::Finalize to encapsulate this in QueryDriver better, but not a big deal. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@943 PS27, Line 943: (*query_handle).EmplaceQueryDriver(this); Similar to noted below, you could clean this up by adding a static QueryDriver::CreateNewDriver(ImpalaServer&, QueryHandle*) http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177 PS27, Line 1177: &* I think you could just pass a const QueryHandle& here and it would be cleaner http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1297 PS27, Line 1297: shared_ptr<QueryDriver> So I believe this copies the shared_ptr, which results in an extra atomic inc/dec. It might be cleaner to just call GetQueryDriver() without calling SetQueryDriver() yet, do the nullptr check, and then call SetQueryDriver() below. I also think it would be nice to have something like a static QueryDriver::CreateHandle(), which would maybe take like an ImpalaServer& for calling GetQueryDriver() and an out QueryHandle* and then if QueryDriver was a friend class of QueryHandle you could get rid of SetQueryDriver/EmplaceQueryDriver/SetClientRequestState http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1504 PS27, Line 1504: QueryHandle handle; I think this is unnecessary and adds another shared_ptr copy. You can just pass 'query_handle' into GetActiveQueryHandle http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@2411 PS27, Line 2411: query_handle request_state? http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py File tests/custom_cluster/test_query_retries.py: http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@92 PS27, Line 92: def test_kill_impalad_expect_retry(self): I don't think I see a test case in here for if the retry is attempted due to an Exec() rpc failing http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@363 PS27, Line 363: def test_retry_query_hs2(self): Fwiw, since beeswax is now deprecated and pretty much all clients use hs2, it would be preferable to have as more testing with hs2 vs. beeswax A lot of these tests work out basically the same for both, so its not necessary to rewrite them all. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 27 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Tue, 12 May 2020 19:32:09 +0000 Gerrit-HasComments: Yes