[Impala-ASF-CR] IMPALA-8858: Add metrics tracking num queries running on executor groups

Lars Volker (Code Review) Wed, 11 Sep 2019 15:13:41 -0700

Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14103 )


Change subject: IMPALA-8858: Add metrics tracking num queries running on 
executor groups
......................................................................


Patch Set 12:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/runtime/exec-env.cc@501
PS12, Line 501:       [server](ClusterMembershipMgr::SnapshotPtr snapshot) {
nit: I'd consider following the FE example and add a helper for this, but since 
it's less code and inside a shorter function, I'm also OK with keeping it here.


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.h
File be/src/scheduling/admission-controller.h:

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.h@508
PS12, Line 508: QueryAndMemory
I thought the shorter version was also good, especially since there's still 
only one parameter.


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.h@988
PS12, Line 988: /// Updates the list of executor groups for which we maintain 
the query load metrics.
nit: indentation


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.h@995
PS12, Line 995: Should
nit: Isn't this mandatory, i.e. shouldn't we say "Must be called"?


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.cc@425
PS12, Line 425:   // Now update the pool stats.
Is there a reason why we do host stats and pool stats in different ordere here 
and below? It's all under the lock, right?


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.cc@431
PS12, Line 431:   PoolStats* pool_stats = GetPoolStats(schedule);
nit: Follow the same pattern above and here (temporary for pool_stats or not, 
add to pools_for_updates_ or not?


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.cc@432
PS12, Line 432:   pool_stats->AdmitQueryAndMemory(schedule);
no need to add to pools_for_updates_?


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.cc@1955
PS12, Line 1955: Locked
The "Locked" part of the name confused me at some of the caller sites because 
other methods that are called together with this one don't have it in their 
name.


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/admission-controller.cc@1958
PS12, Line 1958: (
nit: single line?


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/cluster-membership-mgr.h
File be/src/scheduling/cluster-membership-mgr.h:

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/cluster-membership-mgr.h@65
PS12, Line 65: receive notifications
Mention that no notifications are sent out for blacklisted nodes (see longer 
comment in cc file)


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/cluster-membership-mgr.cc@390
PS12, Line 390:  // - The update sent to the impala server is used to cancel 
queries on backends that
              :   //   are no longer present in 'current_backends', but we 
don't remove the executor from
              :   //   'current_backends' here, since it always reflects the 
full statestore membership.
              :   //   This avoids cancelling queries that may still be running 
successfully, eg. if the
              :   //   backend is still up but got blacklisted due to a flaky 
network. If the backend
              :   //   really is down, we'll still cancel the queries when the 
statestore removes it from
              :   //   the membership.
              :   // - The update sent to the admission controller is used to 
maintain a metric for each
              :   //   group having at least 1 executor that keeps track of the 
queries running on it. We
              :   //   can defer that to the statestore update and avoid 
unnecessary metric deletions due
              :   //   to flaky backends being blacklisted.
              :   // - The update sent to frontend is used to notify the 
planner, but the executors the
              :   //   planner schedules things on is just a hint and the 
Scheduler will still see the
              :   //   updated membership and choose non-blacklisted executors 
regardless of what the
              :   //   planner says, so its fine to wait until the next topic 
update to notify the fe.
I think we need to mention this in the header file near the top where we 
explain the callbacks. Now that the code is more generic, this part still 
assumes that only these three callbacks are registering. If someone adds 
callbacks in the future they might expect them to be called whenever the state 
changes. In addition we should explain that callers should not hold on to the 
snapshot pointer in the callbacks (I thought we could actually do that in the 
AdmissionController and just reuse what we got in the last callback, but that 
way we would miss the blacklisting changes).


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/scheduling/cluster-membership-mgr.cc@443
PS12, Line 443: void ClusterMembershipMgr::SetState(const SnapshotPtr& 
new_state) {
nit: How would this look if we added a parameter "bool update_metrics" with a 
default to "true"? I don't feel strongly about it, feel free to ignore.


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/service/impala-server.cc@1918
PS12, Line 1918: std::unordered_set<TUniqueId>::const_iterator
nit: could use auto, might even fit into the for loop


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/util/metrics.h
File be/src/util/metrics.h:

http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/util/metrics.h@337
PS12, Line 337: the pointer
nit: any pointers


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/util/metrics.h@367
PS12, Line 367: the
nit: any pointers


http://gerrit.cloudera.org:8080/#/c/14103/12/be/src/util/metrics.h@368
PS12, Line 368:  so callers should take care not to use those and to
              :   /// get rid of any references to it to avoid dangling pointer 
issues
I think this part is not needed, it's implied by the first half of the sentence


http://gerrit.cloudera.org:8080/#/c/14103/12/common/thrift/metrics.json
File common/thrift/metrics.json:

http://gerrit.cloudera.org:8080/#/c/14103/12/common/thrift/metrics.json@2505
PS12, Line 2505: seperated
separated, I think the description is also stale.


http://gerrit.cloudera.org:8080/#/c/14103/12/common/thrift/metrics.json@2509
PS12, Line 2509: Total number of executor groups
This label doesn't match the description (number vs list)


http://gerrit.cloudera.org:8080/#/c/14103/12/tests/custom_cluster/test_executor_groups.py
File tests/custom_cluster/test_executor_groups.py:

http://gerrit.cloudera.org:8080/#/c/14103/12/tests/custom_cluster/test_executor_groups.py@335
PS12, Line 335: Test
nit: Tests


http://gerrit.cloudera.org:8080/#/c/14103/12/tests/custom_cluster/test_executor_groups.py@384
PS12, Line 384:       "cluster-membership.executor-groups.total-healthy", 1, 
timeout=30)
also check here that executor-groups.total is still 2?


http://gerrit.cloudera.org:8080/#/c/14103/12/tests/custom_cluster/test_executor_groups.py@411
PS12, Line 411:       "cluster-membership.executor-groups.total-healthy", 0, 
timeout=30)
Also add checks for executor-groups.total here?



--
To view, visit http://gerrit.cloudera.org:8080/14103
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I58cde8699c33af8b87273437e9d8bf6371a34539
Gerrit-Change-Number: 14103
Gerrit-PatchSet: 12
Gerrit-Owner: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Comment-Date: Wed, 11 Sep 2019 22:12:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8858: Add metrics tracking num queries running on executor groups

Reply via email to