Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
......................................................................


Patch Set 27:

(25 comments)

Thanks for the changes. I know I've asked for a lot, but I think the patch is a 
lot better now.

I still have a lot of concerns about this feature, esp. about our story around 
observability, but I think those concerns can mostly be addressed in follow up 
work and most of my remaining comments are more minor things.

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079
PS26, Line 1079:
> Before this patch, if a query fails: 
> ClientRequestState::FinishExecQueryOrDmlRequest()
 > --> Coordinator::Exec() --> Coordinator::StartBackendExec() -->
 > ClientRequestState::UpdateQueryStatus() which sets
 > ClientRequestState::query_status_.

I don't think that's true - ClientRequestState::UpdateQueryStatus() is never 
called from inside Coordinator. If an Exec() rpc fails, before this patch 
Coordinator::Exec() will return an error status and ClientRequestState will 
call UpdateQueryStatus() on itself to set the error that's returned from Exec().

Now, MarkAsRetrying() will set an error on ClientRequestState but after that 
Coordinator::Exec() will still return an error which ClientRequestState will 
try to set on itself with UpdateQueryStatus() but it'll get discarded because 
an error was already set.

As far as I can tell, prior to this patch if the Coordinator needed to 
communicate an error to the ClientRequestState it always did so by returning an 
error Status, eg. if a backend reports an error usually that gets bubbled up 
when the ClientRequestState calls GetNext(), and now we've added this other 
patch where the ClientRequestState calls into the Coordinator and during that 
call the Coordinator calls back into the ClientRequestState to set the error 
even though its also about to return another Status that represents the same 
error with a slightly different text. The point is, its messy and I'm concerned 
it will be error prone.

 > A lot of the re-factoring we end up
 > doing now might just be wasted work if we end up changing half the
 > logic.

Sure, you don't have to take my suggestions, esp. if you have a good reason/a 
vision of your own for how to clean this all up.

I'm trying to help make sure this patch is well designed because I think it'll 
make the further work easier, eg. minimizing the amount of retry related logic 
that ends up in Coordinator should make it easier to make further changes to 
how retries work.

If you're really determined to keep the calls to TryQueryRetry in Coordinator 
in this patch, its fine with me.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910
PS27, Line 910:   Status status = Status::OK();
I don't think declaring this here is necessary since its not used outside of 
the if/else immediately below


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58
PS27, Line 58: std::shared_ptr<QueryDriver>
I think you might want to make this a '&' to avoid unnecessary copies


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93
PS27, Line 93: TryRetryQuery
TryQueryRetry


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136: ///
> TryQueryRetry passes the status to
 > MarkAsRetrying which sets the given status as the query_status_ of
 > the ClientRequestState, which is what gets exposed in the runtime
 > profile.

Right, but MarkAsRetrying doesn't get called in the case where you call 
AddDetail() because TryQueryRetry returns immediately after AddDetail()


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228
PS27, Line 228:   // Run the new query.
I think its fine to leave for now, but just wanted to note that the duplication 
of the query execution logic between here and ImpalaServer is unfortunate and 
it might be nice to have more of a QueryDriver::RunQuery() type function that 
works for both initial attempts and retries when we're doing refactoring.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@214
PS27, Line 214:   Status InitExecRequest(const TQueryCtx& query_ctx);
This is no longer used anywhere


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@222
PS27, Line 222:   std::string ExecStateToString(ExecState state) const;
static?


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@225
PS27, Line 225:   std::string RetryStateToString(RetryState state) const;
static?


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@228
PS27, Line 228:   std::shared_ptr<ImpalaServer::SessionState> session() const { 
return session_; }
I think this copies the shared_ptr, which adds another atomic inc/dec. Is there 
a reason we need to return a shared_ptr here?


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@556
PS27, Line 556: Updated by
              :   /// UpdateQueryStatus
No longer true, due to MarkAsRetrying


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h@561
PS27, Line 561: initialized in InitExecRequest
No longer true. Also maybe note that the QueryDriver owns this.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-beeswax-server.cc@420
PS27, Line 420: Don't use
              :   // GetActiveQueryHandle because clients can request profiles 
for unregistered queries.
Not sure what this is supposed to mean


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@670
PS27, Line 670: query_driver
query_handle


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@868
PS27, Line 868:     /// query_state = union(beeswax::QueryState, 
ClientRequestState::RetryState).
Probably requires more explanation, eg. "used for the http server..."


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@985
PS27, Line 985: 'request_state'
needs to be updated


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.h@1024
PS27, Line 1024: ClientRequestState
needs to be updated


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-server.cc@1160
PS24, Line 1160:   // unregistration work. StartUnregister() succeeds for the 
first thread to call it to
> This was based on https://gerrit.cloudera.org/#/c/15821/10 - basically ther
Ah okay, I hadn't noticed the equivalent started_finalize thing in 
ClientRequestState that this is replacing.

It might be nice to define like a QueryDriver::Finalize that calls 
StartUnregister and ClientRequestState::Finalize to encapsulate this in 
QueryDriver better, but not a big deal.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@943
PS27, Line 943:   (*query_handle).EmplaceQueryDriver(this);
Similar to noted below, you could clean this up by adding a static 
QueryDriver::CreateNewDriver(ImpalaServer&, QueryHandle*)


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177
PS27, Line 1177: &*
I think you could just pass a const QueryHandle& here and it would be cleaner


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1297
PS27, Line 1297: shared_ptr<QueryDriver>
So I believe this copies the shared_ptr, which results in an extra atomic 
inc/dec. It might be cleaner to just call GetQueryDriver() without calling 
SetQueryDriver() yet, do the nullptr check, and then call SetQueryDriver() 
below.

I also think it would be nice to have something like a static 
QueryDriver::CreateHandle(), which would maybe take like an ImpalaServer& for 
calling GetQueryDriver() and an out QueryHandle* and then if QueryDriver was a 
friend class of QueryHandle you could get rid of 
SetQueryDriver/EmplaceQueryDriver/SetClientRequestState


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1504
PS27, Line 1504: QueryHandle handle;
I think this is unnecessary and adds another shared_ptr copy. You can just pass 
'query_handle' into GetActiveQueryHandle


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@2411
PS27, Line 2411: query_handle
request_state?


http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@92
PS27, Line 92:   def test_kill_impalad_expect_retry(self):
I don't think I see a test case in here for if the retry is attempted due to an 
Exec() rpc failing


http://gerrit.cloudera.org:8080/#/c/14824/27/tests/custom_cluster/test_query_retries.py@363
PS27, Line 363:   def test_retry_query_hs2(self):
Fwiw, since beeswax is now deprecated and pretty much all clients use hs2, it 
would be preferable to have as more testing with hs2 vs. beeswax

A lot of these tests work out basically the same for both, so its not necessary 
to rewrite them all.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 27
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 May 2020 19:32:09 +0000
Gerrit-HasComments: Yes

Reply via email to