Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails ...................................................................... Patch Set 12: (4 comments) Updated patch. Now Impala only blacklists a target node of a RPC if that node failed with a specific posix error code: 107, 108, or 111. I decided on these values purely based on what happens to a running query when an Impalad executor crashes in the middle of execution. We can consider adding more to the list later. I had to re-write the test though. It looks like the debug action in IMPALA-8138 causes KRPC calls to fail with a "RemoteError", but in practice KRPC will actually fail with a "NetworkError" (the error types seem to be a Kudu Status feature). If you actually kill an Impala executor while a query is running, KRPC throws a "NetworkError" since the TCP connection probably got killed. I made blacklisting specific to "NetworkError"s. I think this makes more sense since KRPC actually returns a "NetworkError" when the connection dies, and you probably don't want to blacklist a node due to a "RemoteError" (at least I don't know of a failure scenario where you would want to). http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@826 PS11, Line 826: backend_resource_state_->MarkBackendFinished(backend_state, &releasable_backends); > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@831 PS11, Line 831: backend_released_barrier_.Notify(); > line too long (98 > 90) Done http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/krpc-data-stream-sender.cc File be/src/runtime/krpc-data-stream-sender.cc: http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/krpc-data-stream-sender.cc@406 PS11, Line 406: return rpc_status_; note: this turned out to be redundant so removed it http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto File common/protobuf/control_service.proto: http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137 PS10, Line 137: led fragment instance. Use > Fwiw ClusterMembershipMgr::BlacklistExecutor already takes care of making s Sounds good. -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 12 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <michael...@gmail.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Thu, 12 Dec 2019 19:58:28 +0000 Gerrit-HasComments: Yes