Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14507 )
Change subject: [metrics] Hide metric live_row_count when tablet not support ...................................................................... Patch Set 3: (9 comments) http://gerrit.cloudera.org:8080/#/c/14507/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14507/1//COMMIT_MSG@9 PS1, Line 9: When upgrade an tserver from version older than 1.11, and add some : partitions for an existing table, then the table will hold some tablets : that support live row counting together with some ones that not support > The live row counting is base on the newly created tablet, but not for the Can't we figure that out at runtime using the aggregated input from all tablets? Why do we need persistent state for it? If even one tablet doesn't support live row counting, we disable the feature for the table. The moment the last holdout is dropped, we can safely aggregate the live row count and report it to clients. http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/master/table_metrics.cc File src/kudu/master/table_metrics.cc: http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/master/table_metrics.cc@28 PS1, Line 28: METRIC_DEFINE_gauge_uint64(table, assured_live_row_count, "Assured Table Live Row count", Why rename this? It hasn't shipped yet, so we can safely change its semantics while retaining the existing name. http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/master/table_metrics.cc@31 PS1, Line 31: "Partial tablets of the table may not support live row counting, we only sum up the " : "tablets which support."); I think this is a bad idea because clients won't know when this happens and will just see a very odd live row count. I'd rather aggregation be black or white: if all tablets support live row counting, we'll aggregate it and provide it as part of a table's metrics. Otherwise, we don't aggregate or publish it. http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/metadata.proto@199 PS1, Line 199: required uint64 on_disk_size = 1; Technically both of these should be optional; we should always use optional for new fields. http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet.cc File src/kudu/tablet/tablet.cc: http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet.cc@1889 PS1, Line 1889: if (!metadata_->supports_live_row_count()) { : return Status::NotSupported("This tablet doesn't support live row counting"); : } Should probably be a DCHECK now; it's expected that callers ensure SupportsLiveRowCount is true first, right? http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/tablet/tablet_replica.cc@824 PS1, Line 824: ignore_result(tablet->CountLiveRows(&ret)); So we can get a 0 in one of three ways: 1. The live row count is actually 0. 2. CountLiveRows returned a RuntimeError. 3. tablet was null. Ultimately we only want to report the live row count in the first case, right? Seems like we need to restructure the code further to do that: - Tablet needs only one method to get the live row count. It returns a bad Status if it doesn't support it, or if the tablet was shut down. So basically, the status quo before your change. - TabletReplica also needs just one method to get the live row count. However, that method should return a bad Status if Tablet::CountLiveRows returned a bad Status, and another bad Status if the tablet is null. - UpdateTabletStats calls TabletReplica::CountLiveRows but skips set_live_row_count() if there was a bad Status. - The metrics change (to check whether we support live row counting and instantiate accordingly) should remain, but the function binding will need to be adjusted. What happens to the metric when the tablet is shut down? In theory we should make it hidden, right? Should TabletReplica::Stop InvalidateEpoch on the metric? http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.h File src/kudu/util/metrics.h: http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.h@775 PS1, Line 775: // When any of the metrics is invalid for MergeFrom, the result will be invalidate and Too long, need to wrap. http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.h@794 PS1, Line 794: So as far as I can tell this is true when: 1. A metric is no longer hidden, or was never instantiated as hidden, and 2. A metric has been touched at least once, or is of the kind that can't tell whether it's been touched (e.g. a Gauge). The first part makes sense, but why the second part? We only filter on "touchedness" when dumping the diagnostics log, not in the general case of dumping metrics in the web UI. If we removed the second part, then the semantics of InvalidateIfNeededInMerge make more sense: if one of the two metrics to be merged is still hidden, the other must be made hidden too. http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.cc File src/kudu/util/metrics.cc: http://gerrit.cloudera.org:8080/#/c/14507/1/src/kudu/util/metrics.cc@775 PS1, Line 775: UpdateModificationEpoch(); Oops, I guess we should have had this here from day one? -- To view, visit http://gerrit.cloudera.org:8080/14507 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2a54704d8cbd64a521e65aa3e95bf1a68f7757b7 Gerrit-Change-Number: 14507 Gerrit-PatchSet: 3 Gerrit-Owner: Yingchun Lai <405403...@qq.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Reviewer: Yingchun Lai <405403...@qq.com> Gerrit-Reviewer: helifu <hzhel...@corp.netease.com> Gerrit-Comment-Date: Fri, 18 Oct 2019 18:32:16 +0000 Gerrit-HasComments: Yes