Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the 
node fails
......................................................................


Patch Set 12:

(4 comments)

Updated patch. Now Impala only blacklists a target node of a RPC if that node 
failed with a specific posix error code: 107, 108, or 111. I decided on these 
values purely based on what happens to a running query when an Impalad executor 
crashes in the middle of execution. We can consider adding more to the list 
later.

I had to re-write the test though. It looks like the debug action in 
IMPALA-8138 causes KRPC calls to fail with a "RemoteError", but in practice 
KRPC will actually fail with a "NetworkError" (the error types seem to be a 
Kudu Status feature). If you actually kill an Impala executor while a query is 
running, KRPC throws a "NetworkError" since the TCP connection probably got 
killed. I made blacklisting specific to "NetworkError"s. I think this makes 
more sense since KRPC actually returns a "NetworkError" when the connection 
dies, and you probably don't want to blacklist a node due to a "RemoteError" 
(at least I don't know of a failure scenario where you would want to).

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@826
PS11, Line 826:     backend_resource_state_->MarkBackendFinished(backend_state, 
&releasable_backends);
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@831
PS11, Line 831:         backend_released_barrier_.Notify();
> line too long (98 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/krpc-data-stream-sender.cc
File be/src/runtime/krpc-data-stream-sender.cc:

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/krpc-data-stream-sender.cc@406
PS11, Line 406:     return rpc_status_;
note: this turned out to be redundant so removed it


http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137
PS10, Line 137: led fragment instance. Use
> Fwiw ClusterMembershipMgr::BlacklistExecutor already takes care of making s
Sounds good.



--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <michael...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 19:58:28 +0000
Gerrit-HasComments: Yes

Reply via email to