Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/14630 )
Change subject: [metrics] Add metric severity level ...................................................................... Patch Set 7: (16 comments) http://gerrit.cloudera.org:8080/#/c/14630/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14630/6//COMMIT_MSG@13 PS6, Line 13: Additionally, as the tablet count grows within the cluster, the tablet level : metrics may overwhelm users and some metric collection tools. For example : a server with 37 > I'd rephrase this a bit: "Additionally, as the tablet count grows within th Done http://gerrit.cloudera.org:8080/#/c/14630/6//COMMIT_MSG@30 PS6, Line 30: > over time Done http://gerrit.cloudera.org:8080/#/c/14630/6/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/14630/6/docs/administration.adoc@111 PS6, Line 111: - `/metrics?level=info` - limits the returned metrics based on their severity level. > Should probably doc the default level here too. Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/consensus/log_metrics.cc File src/kudu/consensus/log_metrics.cc: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/consensus/log_metrics.cc@30 PS6, Line 30: kudu::MetricLevel::kInfo, > Likewise. Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/consensus/log_metrics.cc@36 PS6, Line 36: kudu::MetricLevel::kInfo, > This one is more useful, kInfo? Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/consensus/log_reader.cc File src/kudu/consensus/log_reader.cc: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/consensus/log_reader.cc@55 PS6, Line 55: kudu::MetricLevel::kInfo, > kInfo Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/master/sentry_client_metrics.cc File src/kudu/master/sentry_client_metrics.cc: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/master/sentry_client_metrics.cc@46 PS6, Line 46: kudu::MetricLevel::kInfo, > kInfo on this one too. Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_metrics.cc File src/kudu/tablet/tablet_metrics.cc: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_metrics.cc@30 PS6, Line 30: METRIC_DEFINE_counter(tablet, rows_inserted, "Rows Inserted", : kudu::MetricUnit::kRows, : "Number of rows inserted into this tablet since service start", : kudu::MetricLevel::kDebug); : METRIC_DEFINE_counter(tablet, rows_upserted, "Rows Upserted", : kudu::MetricUnit::kRows, : "Number of rows upserted into this tablet since service start", : kudu::MetricLevel::kDebug); : METRIC_DEFINE_counter(tablet, rows_updated, "Rows Updated", : kudu::MetricUnit::kRows, : "Number of row update operations performed on this tablet since service start", : kudu::MetricLevel::kDebug); : METRIC_DEFINE_counter(tablet, rows_deleted, "Rows Deleted", : kudu::MetricUnit::kRows, : "Number of row delete operations performed on this tablet since service start", : kudu::MetricLevel::kDebug); > I think these should be kInfo. Why are these especially useful? Given they are cumulative since service start I find them harder to use and interpret. http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_metrics.cc@247 PS6, Line 247: METRIC_DEFINE_histogram(tablet, flush_dms_duration, > These histograms are super useful; they should be kInfo. Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_metrics.cc@307 PS6, Line 307: kudu::MetricLevel::kWarn); > kWarn, this could indicate too much memory pressure. Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_replica.cc@64 PS6, Line 64: METRIC_DEFINE_histogram(tablet, op_prepare_queue_length, "Operation Prepare Queue Length", : kudu::MetricUnit::kTasks, : "Number of operations waiting to be prepared within this tablet. " : "High queue lengths indicate that the server is unable to process " : "operations as fast as they are being written to the WAL.", : kudu::MetricLevel::kInfo, : 10000, 2); : : METRIC_DEFINE_histogram(tablet, op_prepare_queue_time, "Operation Prepare Queue Time", : kudu::MetricUnit::kMicroseconds, : "Time that operations spent waiting in the prepare queue before being " : "processed. High queue times indicate that the server is unable to " : "process operations as fast as they are being written to the WAL.", : kudu::MetricLevel::kInfo, : 10000000, 2); : : METRIC_DEFINE_histogram(tablet, op_prepare_run_time, "Operation Prepare Run Time", : kudu::MetricUnit::kMicroseconds, : "Time that operations spent being prepared in the tablet. " : "High values may indicate that the server is under-provisioned or " : "that operations are experiencing high contention with one another for " : "locks.", : kudu::MetricLevel::kInfo, : 10000000, 2); > kInfo, these are often useful in diagnosing lock contention (i.e. several " Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_replica_mm_ops.cc File src/kudu/tablet/tablet_replica_mm_ops.cc: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/tablet/tablet_replica_mm_ops.cc@84 PS6, Line 84: METRIC_DEFINE_histogram(tablet, log_gc_duration, > Generally speaking I think all histograms associated with MM ops should be Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/util/metrics.h File src/kudu/util/metrics.h: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/util/metrics.h@420 PS6, Line 420: always wa > Nit: second use of 'generally', drop Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/util/metrics.h@427 PS6, Line 427: enum class MetricLevel { > Why wrap it in a struct? You could define Name as a free function and just Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/util/metrics.h@452 PS6, Line 452: : struct MergeAttributes { > Why a vector and not just a Level? Doesn't one level imply all the levels b Done http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/util/metrics.cc File src/kudu/util/metrics.cc: http://gerrit.cloudera.org:8080/#/c/14630/6/src/kudu/util/metrics.cc@41 PS6, Line 41: : // Process/server-wide metrics should go into the 'server' entity. : // More complex applications will define other entities. : METRIC_DEFINE_entity(server); : : namespace kudu { : : using std::string; > On second thought, I don't think we want a gflag for this at all. Level-bas The idea here was that if we know certain metrics aren't commonly used and introduce significant load at scale (either through logging or URL) we could omit them by default but allow users to opt-in to seeing them. I can remove the flag for now. We can add it back if we find it useful. -- To view, visit http://gerrit.cloudera.org:8080/14630 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5aa7d2323bb75700104c348a3ae859fc449e1715 Gerrit-Change-Number: 14630 Gerrit-PatchSet: 7 Gerrit-Owner: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Attila Bukor <abu...@apache.org> Gerrit-Reviewer: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Yingchun Lai <405403...@qq.com> Gerrit-Reviewer: helifu <hzhel...@corp.netease.com> Gerrit-Comment-Date: Thu, 07 Nov 2019 22:44:34 +0000 Gerrit-HasComments: Yes