[Impala-ASF-CR] IMPALA-5746: Cancel all queries scheduled by failed coordinators

Wenzhe Zhou (Code Review) Tue, 21 Jul 2020 13:32:15 -0700

Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16215 )


Change subject: IMPALA-5746: Cancel all queries scheduled by failed coordinators
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16215/3/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

http://gerrit.cloudera.org:8080/#/c/16215/3/be/src/runtime/exec-env.cc@554
PS3, Line 554:           
server->CancelQueriesOnFailedBackends(current_backend_set);
> Yeah for this and your other comment below, probably the cleanest thing is
I was thinking to reuse the backend set and save a loop with one callback 
function to handle two tasks. But it seems not clean. Will change to register 
two separate callback functions as suggested by Thomas.


http://gerrit.cloudera.org:8080/#/c/16215/3/be/src/runtime/exec-env.cc@560
PS3, Line 560:     cluster_membership_mgr_->RegisterUpdateCallbackFn(
             :         [server](ClusterMembershipMgr::SnapshotPtr snapshot) {
             :           std::unordered_set<BackendIdPB> current_backend_set;
             :           for (const auto& it : snapshot->current_backends) {
             :             current_backend_set.insert(it.second.backend_id());
             :           }
             :           
server->CancelQueriesOnFailedBackends(current_backend_set);
             :         });
> not necessarily related to your change, but it would be nice to make this c
Will fix it.


http://gerrit.cloudera.org:8080/#/c/16215/3/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16215/3/be/src/runtime/query-exec-mgr.cc@222
PS3, Line 222:   // TODO: create cancellation task queue and working thread to 
run cancellation tasks
             :   // on a separate thread. If the queue is full, ignore the 
cancellations since we'll
             :   // be able to process them on the next heartbeat instead.
             :
             :   for (auto& qs : to_cancel) {
             :     VLOG(1) << "CancelQueriesForFailedCoordinators(): cancel 
query " << qs->query_id();
             :     qs->Cancel();
             :     qs->is_coord_active_.Store(false);
             :     ReleaseQueryState(qs);
             :   }
> I had suggested not to put this in ImpalaServer because its primarily the c
Will define a new thread pool owned by QueryExecMgr.


http://gerrit.cloudera.org:8080/#/c/16215/3/be/src/runtime/query-exec-mgr.cc@222
PS3, Line 222:   // TODO: create cancellation task queue and working thread to 
run cancellation tasks
             :   // on a separate thread. If the queue is full, ignore the 
cancellations since we'll
             :   // be able to process them on the next heartbeat instead.
             :
             :   for (auto& qs : to_cancel) {
             :     VLOG(1) << "CancelQueriesForFailedCoordinators(): cancel 
query " << qs->query_id();
             :     qs->Cancel();
             :     qs->is_coord_active_.Store(false);
             :     ReleaseQueryState(qs);
             :   }
> if you move this into impala-server.cc you can use the ImpalaServer::Cancel
The item type and work_function which runs for each item are set when creating 
a thread pool. It seems that one thread pool cannot serve two different types 
of tasks. The class CancellationWork is defined to handle cancellation of 
remote running fragments for coordinator. Maybe we should define a separate 
cancellation work for executer to cancel local running fragments and create a 
separate thread pool with one thread.


http://gerrit.cloudera.org:8080/#/c/16215/3/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/16215/3/common/thrift/ImpalaInternalService.thrift@607
PS3, Line 607:   27: optional Types.TUniqueId coord_backend_id
Will change coord_backend_id data type as UniqueIdPB and move it to 
FragmentInstanceExecStatusPB in control_service.proto.


http://gerrit.cloudera.org:8080/#/c/16215/3/tests/custom_cluster/test_restart_services.py
File tests/custom_cluster/test_restart_services.py:

http://gerrit.cloudera.org:8080/#/c/16215/3/tests/custom_cluster/test_restart_services.py@241
PS3, Line 241:   def test_kill_coordinator(self):
> This is an okay place for this test (obviously we have a lot of different t
Yes, it's better to move it to test_process_failures.py. Will do.



--
To view, visit http://gerrit.cloudera.org:8080/16215
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I918fcc27649d5d2bbe8b6ef47fbd9810ae5f57bd
Gerrit-Change-Number: 16215
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Jul 2020 20:30:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5746: Cancel all queries scheduled by failed coordinators

Reply via email to