Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist ...................................................................... Patch Set 3: (5 comments) http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h File be/src/common/status.h: http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@254 PS3, Line 254: bool IsRecoverableError() const { Might want to add some documentation about the difference between a 'recoverable' error and a 'retryable' error. http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@271 PS3, Line 271: void SetIsRetryable() { Might be nice to have this and SetRPCErrorMsg return '*this' to allow for patterns like: return Status("error").SetIsRetryable(); http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift File common/thrift/Common.thrift: http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1 PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one Why was it necessary to add this file, rather than just including Types.thrift in Status.thrift? http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift File common/thrift/Status.thrift: http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@33 PS3, Line 33: enum TErrorType { Do you have other error types in mind for the future, i.e. why not just do a 'bool retryable' in TStatus instead? http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48 PS3, Line 48: 4: optional TRPCErrorMessage rpc_msg This feels a little too specific to this patch/not extensible. We probably don't want to add a bunch of extra fields to TStatus for various other retriable failure scenarios. Some other scenarios we might need to address in the future: - Ask the coordinator to blacklist a node for a reason other than an rpc failure, eg. because it has a bad disk - Ask the coordinator to perform an action other than blacklisting a node before retrying the query, eg. if the failure is something where the query needs to be re-planned and with or without reloading metadata Some ideas for how to address these issues: - Make this something more generic, like a string->string map, then for this patch have a "BLACKLIST"-><hostname> entry in the map - Instead of putting this stuff in the TStatus, put it somewhere like in the ReportExecStatusRequestPB, eg. add a list<TNetworkAddress> nodes_to_blacklist. That's kind of nice because its not clear to me that it makes sense to think of this stuff as a property of the status anyways. The downside is that these error statuses can come from lots of different places and it may be hard to bubble up the relevant info if its not returned in the status, but maybe we could add functions for it to RuntimeState or FragmentInstanceState or whatever. That could also make this more flexible - you can have a query that succeeds but still tells the coordinator to blacklist a node because it was detected to be slow or something. - Maybe just rename it to RetryInfo or something, then if it ends up with 5-10 fields for different scenarios at least its all clearly contained together We can also of course just punt on this decision until we do the follow up work that would actually need it, eg. maybe we won't ever use this sort of mechanism for blacklisting nodes for stuff like bad disks because that will just be covered by the extended health check mechanism. Just stuff to think about -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Fri, 15 Nov 2019 23:41:27 +0000 Gerrit-HasComments: Yes