Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and 
update blacklist
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h
File be/src/common/status.h:

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@254
PS3, Line 254:   bool IsRecoverableError() const {
Might want to add some documentation about the difference between a 
'recoverable' error and a 'retryable' error.


http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@271
PS3, Line 271:   void SetIsRetryable() {
Might be nice to have this and SetRPCErrorMsg return '*this' to allow for 
patterns like:

return Status("error").SetIsRetryable();


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift
File common/thrift/Common.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1
PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
Why was it necessary to add this file, rather than just including Types.thrift 
in Status.thrift?


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@33
PS3, Line 33: enum TErrorType {
Do you have other error types in mind for the future, i.e. why not just do a 
'bool retryable' in TStatus instead?


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48
PS3, Line 48: 4: optional TRPCErrorMessage rpc_msg
This feels a little too specific to this patch/not extensible. We probably 
don't want to add a bunch of extra fields to TStatus for various other 
retriable failure scenarios. Some other scenarios we might need to address in 
the future:
- Ask the coordinator to blacklist a node for a reason other than an rpc 
failure, eg. because it has a bad disk
- Ask the coordinator to perform an action other than blacklisting a node 
before retrying the query, eg. if the failure is something where the query 
needs to be re-planned and with or without reloading metadata

Some ideas for how to address these issues:
- Make this something more generic, like a string->string map, then for this 
patch have a "BLACKLIST"-><hostname> entry in the map
- Instead of putting this stuff in the TStatus, put it somewhere like in the 
ReportExecStatusRequestPB, eg. add a list<TNetworkAddress> nodes_to_blacklist. 
That's kind of nice because its not clear to me that it makes sense to think of 
this stuff as a property of the status anyways. The downside is that these 
error statuses can come from lots of different places and it may be hard to 
bubble up the relevant info if its not returned in the status, but maybe we 
could add functions for it to RuntimeState or FragmentInstanceState or 
whatever. That could also make this more flexible - you can have a query that 
succeeds but still tells the coordinator to blacklist a node because it was 
detected to be slow or something.
- Maybe just rename it to RetryInfo or something, then if it ends up with 5-10 
fields for different scenarios at least its all clearly contained together

We can also of course just punt on this decision until we do the follow up work 
that would actually need it, eg. maybe we won't ever use this sort of mechanism 
for blacklisting nodes for stuff like bad disks because that will just be 
covered by the extended health check mechanism. Just stuff to think about



--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Fri, 15 Nov 2019 23:41:27 +0000
Gerrit-HasComments: Yes

Reply via email to