Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails ...................................................................... Patch Set 10: (11 comments) Addressed all comments except for the one on the commit message. That will require a few more changes I am still working through. http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h@252 PS10, Line 252: /// TNetworkAddress. > mention ownership, nullptr (see above) Done http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@209 PS10, Line 209: if (UNLIKELY(!addr_to_backend_state_.insert(addr_to_backend_state).second)) { > you can use emplace() to construct the element in place. Done http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@815 PS10, Line 815: for (auto status_iter = request.instance_exec_status().begin(); > Can you use a range-based loop? Done http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@826 PS10, Line 826: addr_to_backend_state_.at(NetworkAddressPBToTNetworkAddress(dest_node)); > This will throw an exception if the element does not exist (which is almost Done http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@829 PS10, Line 829: blacklisted_node = true; > Could just break; the loop here. I can't see where blacklisted_node is used Done http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@322 PS10, Line 322: AuxErrorInfoPB* aux_error_info() { > Another option to improve encapsulation would be to do what we do with the Done. Followed Thomas' suggestion. Changed to GetAuxErrorInfo(AuxErrorInfoPB). http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@444 PS10, Line 444: /// query_status_ != Status::OK(). > Mention who owns this and who cleans it up? We should not rely on the rest Done http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc@317 PS10, Line 317: NetworkAddressPB* network_addr = new NetworkAddressPB(); > Yeah, strong preference for 'mutable_*' over the other options. Done http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h File be/src/util/network-util.h: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h@72 PS10, Line 72: TNetworkAddress NetworkAddressPBToTNetworkAddress(const NetworkAddressPB& address); > I think FromNetworkAddressPB() would be descriptive enough since the callin Done http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto File common/protobuf/control_service.proto: http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137 PS10, Line 137: RPC to another node failed > I think here it would be good to mention that this is between two executors Technically, RPCErrorInfoPB could contain the address of the Coordinator. IIUC KRPC is used between the Coordinator fragment <-- Executor fragment RPCs. I updated the code in coordinator.cc so that it doesn't blacklist itself even if the RPCErrorInfoPB::dest_node is set to the Coordinator address. I updated the commit message to document this as well. http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py File tests/custom_cluster/test_blacklist.py: http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py@121 PS10, Line 121: """Verifies that an RPC failure causes the target node to be blacklisted. The > Can you add a test that checks that a failure between the two non-coordinat Changed this test to use a single dedicated coordinator. -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 10 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <michael...@gmail.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Thu, 12 Dec 2019 03:02:03 +0000 Gerrit-HasComments: Yes