Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14507 )

Change subject: [metrics] Hide metric live_row_count when tablet not support
......................................................................


Patch Set 3:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/14507/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14507/1//COMMIT_MSG@9
PS1, Line 9: When upgrade an tserver from version older than 1.11, and add some
           : partitions for an existing table, then the table will hold some 
tablets
           : that support live row counting together with some ones that not 
support
> The live row counting is base on the newly created tablet, but not for the
Can't we figure that out at runtime using the aggregated input from all 
tablets? Why do we need persistent state for it? If even one tablet doesn't 
support live row counting, we disable the feature for the table. The moment the 
last holdout is dropped, we can safely aggregate the live row count and report 
it to clients.


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/master/table_metrics.cc
File src/kudu/master/table_metrics.cc:

http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/master/table_metrics.cc@28
PS1, Line 28: METRIC_DEFINE_gauge_uint64(table, assured_live_row_count, 
"Assured Table Live Row count",
Why rename this? It hasn't shipped yet, so we can safely change its semantics 
while retaining the existing name.


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/master/table_metrics.cc@31
PS1, Line 31:     "Partial tablets of the table may not support live row 
counting, we only sum up the "
            :     "tablets which support.");
I think this is a bad idea because clients won't know when this happens and 
will just see a very odd live row count. I'd rather aggregation be black or 
white: if all tablets support live row counting, we'll aggregate it and provide 
it as part of a table's metrics. Otherwise, we don't aggregate or publish it.


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/metadata.proto
File src/kudu/tablet/metadata.proto:

http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/metadata.proto@199
PS1, Line 199:   required uint64 on_disk_size = 1;
Technically both of these should be optional; we should always use optional for 
new fields.


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet.cc@1889
PS1, Line 1889:   if (!metadata_->supports_live_row_count()) {
              :     return Status::NotSupported("This tablet doesn't support 
live row counting");
              :   }
Should probably be a DCHECK now; it's expected that callers ensure 
SupportsLiveRowCount is true first, right?


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet_replica.cc@824
PS1, Line 824:     ignore_result(tablet->CountLiveRows(&ret));
So we can get a 0 in one of three ways:
1. The live row count is actually 0.
2. CountLiveRows returned a RuntimeError.
3. tablet was null.

Ultimately we only want to report the live row count in the first case, right? 
Seems like we need to restructure the code further to do that:

- Tablet needs only one method to get the live row count. It returns a bad 
Status if it doesn't support it, or if the tablet was shut down. So basically, 
the status quo before your change.
- TabletReplica also needs just one method to get the live row count. However, 
that method should return a bad Status if Tablet::CountLiveRows returned a bad 
Status, and another bad Status if the tablet is null.
- UpdateTabletStats calls TabletReplica::CountLiveRows but skips 
set_live_row_count() if there was a bad Status.
- The metrics change (to check whether we support live row counting and 
instantiate accordingly) should remain, but the function binding will need to 
be adjusted.

What happens to the metric when the tablet is shut down? In theory we should 
make it hidden, right? Should TabletReplica::Stop InvalidateEpoch on the metric?


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.h
File src/kudu/util/metrics.h:

http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.h@775
PS1, Line 775:   // When any of the metrics is invalid for MergeFrom, the 
result will be invalidate and
Too long, need to wrap.


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.h@794
PS1, Line 794:
So as far as I can tell this is true when:
1. A metric is no longer hidden, or was never instantiated as hidden, and
2. A metric has been touched at least once, or is of the kind that can't tell 
whether it's been touched (e.g. a Gauge).

The first part makes sense, but why the second part? We only filter on 
"touchedness" when dumping the diagnostics log, not in the general case of 
dumping metrics in the web UI.

If we removed the second part, then the semantics of InvalidateIfNeededInMerge 
make more sense: if one of the two metrics to be merged is still hidden, the 
other must be made hidden too.


http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.cc
File src/kudu/util/metrics.cc:

http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.cc@775
PS1, Line 775:   UpdateModificationEpoch();
Oops, I guess we should have had this here from day one?



--
To view, visit http://gerrit.cloudera.org:8080/14507
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2a54704d8cbd64a521e65aa3e95bf1a68f7757b7
Gerrit-Change-Number: 14507
Gerrit-PatchSet: 3
Gerrit-Owner: Yingchun Lai <405403...@qq.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Reviewer: Yingchun Lai <405403...@qq.com>
Gerrit-Reviewer: helifu <hzhel...@corp.netease.com>
Gerrit-Comment-Date: Fri, 18 Oct 2019 18:32:16 +0000
Gerrit-HasComments: Yes

Reply via email to