Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and 
update blacklist
......................................................................


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h
File be/src/common/status.h:

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@254
PS3, Line 254:   /// Mark the error as recoverable. Clients can recover from 
recoverable errors and a
> Might want to add some documentation about the difference between a 'recove
Done


http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@271
PS3, Line 271:         && msg_->error() == TErrorCode::DISK_IO_ERROR;
> Might be nice to have this and SetRPCErrorMsg return '*this' to allow for p
Done


http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc
File be/src/common/status.cc:

http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc@315
PS4, Line 315:         
msg_->SetRPCErrorMessage(RPCErrorMessage(status.retry_info.rpc_error_msg.dest_node));
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc@906
PS4, Line 906:         "Restarting of fetch requires enabling of query result 
caching.")).SetIsRecoverable();
> line too long (94 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift
File common/thrift/Common.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1
PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
> Why was it necessary to add this file, rather than just including Types.thr
Types.thrift contains a bunch of thrift structs that aren't relevant to 
Status.thrift and I didn't want to populate the file with unnecessary imports 
(I couldn't find a way to import a specific struct from another file). I think 
the re-factoring is cleaner as well.


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@33
PS3, Line 33: struct TStatusProperties {
> Do you have other error types in mind for the future, i.e. why not just do
Yeah, that is a good point. I was originally thinking this would be used for 
IMPALA-8714, but after thinking about it some more, I don't think it makes that 
much sense to make RETRYABLE a type.

I changed this and introduced a TStatusProperties which is just a generic 
struct that is used to describe properties of a TStatus object. For now, I put 
in 'is_retryable' and 'is_recoverable' as the only two entries.


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48
PS3, Line 48:
> This feels a little too specific to this patch/not extensible. We probably
Agree those use cases are important. I think the current mechanism will work 
well for retrying queries that would otherwise fail - e.g. due to an RPC 
failure, or a bad disk. The relevant retry information can simply be propagated 
via the Status object, which is created on failure anyway.

I think one issue with a string->string map is that it is too free-form, all 
users of the map will need to be aware of special keywords for encoding 
information in the map. I think making things more strongly typed via Thrift 
objects would be cleaner, and make the code easier to understand.

I think using the information for ReportExecStatusRequestPB will be necessary 
when we start looking into "limping nodes" - e.g. detecting nodes that are 
slower than the others, and then blacklisting them. I think we would need to 
use ReportExecStatusRequestPB in combination with the changes to TStatus. I 
think it is okay to have the information come from multiple places: e.g. from 
exec status reports and from Status objects. They represent two distinct things 
anyway: blacklisting due to due actual query errors vs. blacklisting nodes 
because they are slow.

I like the idea of using RetryInfo to encapsulate everything so that the 
TStatus struct doesn't get polluted with a bunch of optional structs. So I did 
that.



--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 03:19:40 +0000
Gerrit-HasComments: Yes

Reply via email to