Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist ...................................................................... Patch Set 4: (7 comments) http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h File be/src/common/status.h: http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@254 PS3, Line 254: /// Mark the error as recoverable. Clients can recover from recoverable errors and a > Might want to add some documentation about the difference between a 'recove Done http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@271 PS3, Line 271: && msg_->error() == TErrorCode::DISK_IO_ERROR; > Might be nice to have this and SetRPCErrorMsg return '*this' to allow for p Done http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc File be/src/common/status.cc: http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc@315 PS4, Line 315: msg_->SetRPCErrorMessage(RPCErrorMessage(status.retry_info.rpc_error_msg.dest_node)); > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc File be/src/service/client-request-state.cc: http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc@906 PS4, Line 906: "Restarting of fetch requires enabling of query result caching.")).SetIsRecoverable(); > line too long (94 > 90) Done http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift File common/thrift/Common.thrift: http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1 PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one > Why was it necessary to add this file, rather than just including Types.thr Types.thrift contains a bunch of thrift structs that aren't relevant to Status.thrift and I didn't want to populate the file with unnecessary imports (I couldn't find a way to import a specific struct from another file). I think the re-factoring is cleaner as well. http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift File common/thrift/Status.thrift: http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@33 PS3, Line 33: struct TStatusProperties { > Do you have other error types in mind for the future, i.e. why not just do Yeah, that is a good point. I was originally thinking this would be used for IMPALA-8714, but after thinking about it some more, I don't think it makes that much sense to make RETRYABLE a type. I changed this and introduced a TStatusProperties which is just a generic struct that is used to describe properties of a TStatus object. For now, I put in 'is_retryable' and 'is_recoverable' as the only two entries. http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48 PS3, Line 48: > This feels a little too specific to this patch/not extensible. We probably Agree those use cases are important. I think the current mechanism will work well for retrying queries that would otherwise fail - e.g. due to an RPC failure, or a bad disk. The relevant retry information can simply be propagated via the Status object, which is created on failure anyway. I think one issue with a string->string map is that it is too free-form, all users of the map will need to be aware of special keywords for encoding information in the map. I think making things more strongly typed via Thrift objects would be cleaner, and make the code easier to understand. I think using the information for ReportExecStatusRequestPB will be necessary when we start looking into "limping nodes" - e.g. detecting nodes that are slower than the others, and then blacklisting them. I think we would need to use ReportExecStatusRequestPB in combination with the changes to TStatus. I think it is okay to have the information come from multiple places: e.g. from exec status reports and from Status objects. They represent two distinct things anyway: blacklisting due to due actual query errors vs. blacklisting nodes because they are slow. I like the idea of using RetryInfo to encapsulate everything so that the TStatus struct doesn't get polluted with a bunch of optional structs. So I did that. -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 4 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Tue, 19 Nov 2019 03:19:40 +0000 Gerrit-HasComments: Yes