Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the 
node fails
......................................................................


Patch Set 10:

(11 comments)

Addressed all comments except for the one on the commit message. That will 
require a few more changes I am still working through.

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h@252
PS10, Line 252:   /// TNetworkAddress.
> mention ownership, nullptr (see above)
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@209
PS10, Line 209:     if 
(UNLIKELY(!addr_to_backend_state_.insert(addr_to_backend_state).second)) {
> you can use emplace() to construct the element in place.
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@815
PS10, Line 815:       for (auto status_iter = 
request.instance_exec_status().begin();
> Can you use a range-based loop?
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@826
PS10, Line 826:               
addr_to_backend_state_.at(NetworkAddressPBToTNetworkAddress(dest_node));
> This will throw an exception if the element does not exist (which is almost
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@829
PS10, Line 829:           blacklisted_node = true;
> Could just break; the loop here. I can't see where blacklisted_node is used
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@322
PS10, Line 322:   AuxErrorInfoPB* aux_error_info() {
> Another option to improve encapsulation would be to do what we do with the
Done.

Followed Thomas' suggestion. Changed to GetAuxErrorInfo(AuxErrorInfoPB).


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@444
PS10, Line 444:   /// query_status_ != Status::OK().
> Mention who owns this and who cleans it up? We should not rely on the rest
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc@317
PS10, Line 317:   NetworkAddressPB* network_addr = new NetworkAddressPB();
> Yeah, strong preference for 'mutable_*' over the other options.
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h
File be/src/util/network-util.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h@72
PS10, Line 72: TNetworkAddress NetworkAddressPBToTNetworkAddress(const 
NetworkAddressPB& address);
> I think FromNetworkAddressPB() would be descriptive enough since the callin
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137
PS10, Line 137: RPC to another node failed
> I think here it would be good to mention that this is between two executors
Technically, RPCErrorInfoPB could contain the address of the Coordinator. IIUC 
KRPC is used between the Coordinator fragment <-- Executor fragment RPCs.

I updated the code in coordinator.cc so that it doesn't blacklist itself even 
if the RPCErrorInfoPB::dest_node is set to the Coordinator address. I updated 
the commit message to document this as well.


http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py@121
PS10, Line 121:     """Verifies that an RPC failure causes the target node to 
be blacklisted. The
> Can you add a test that checks that a failure between the two non-coordinat
Changed this test to use a single dedicated coordinator.



--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <michael...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 03:02:03 +0000
Gerrit-HasComments: Yes

Reply via email to