Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13426 )
Change subject: KUDU-2797: the master aggregates tablet metrics ...................................................................... Patch Set 6: (10 comments) http://gerrit.cloudera.org:8080/#/c/13426/6//COMMIT_MSG Commit Message: PS6: Could you update this to use "stats" too? Could you also update this to describe the behavior of collecting stats from all replicas, but only displaying one. Do you plan on using the follower stats at all? http://gerrit.cloudera.org:8080/#/c/13426/6//COMMIT_MSG@15 PS6, Line 15: And on this basis, we can expose this information to Impala Question about how you intend on using this: This patch currently displays this in the master web UI, which is useful to operators. But these also only get updated every 5 minutes by default. In your experience, is this sufficient for Impala queries? Should we also expose an API to compute and aggregate these at runtime so Impala can get the most up-to-date stats? http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/master/catalog_manager.cc@5257 PS6, Line 5257: const auto it = replica_stats_.find(replica_id); : if (it != replica_stats_.end()) { : return it->second; : } nit: This may be simpler if you use FindOrNull, e.g. { <lock> const auto* stats = FindOrNull(replica_stats_, replica_id); if (stats) { return *stats; } } return ReportedTabletStatsPB(); http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/heartbeater.h File src/kudu/tserver/heartbeater.h: PS6: nit: now that this hasn't changed, could you revert this file? http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/heartbeater.cc File src/kudu/tserver/heartbeater.cc: http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/heartbeater.cc@559 PS6, Line 559: // Update the tablet statistics if necessary. nit: maybe consider putting this by L442. DoHeartbeat seems to contain the logic for putting the heartbeat request together (and sending them), whereas RunThread focuses more on scheduling the heartbeats and handling the RPC response. Also it seems like there is at least some other code in DoHeartbeat that updates server state. http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/ts_tablet_manager.cc@1289 PS6, Line 1289: StatsMap::const_iterator it = stats_map_.find(replica->tablet_id()); : if (it != stats_map_.end()) { : reported_tablet->mutable_stats()->CopyFrom(*it->second); : } FindOrNull might be convenient here too. Also I think we should be able to do something like: *reported_tablet->mutable_stats() = *tablet_stats; instead of CopyFrom(). http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/ts_tablet_manager.cc@1545 PS6, Line 1545: // Check if the time is up. : { : shared_lock<rw_spinlock> l(lock_stats_); : if (MonoTime::Now() < next_update_time_) { : return; : } : } : : // Only one thread is allowed to update at the same time. : if (!lock_stats_.try_lock()) { : return; : } Rather than locking twice, what do you think about something like: if (!lock_stats_.try_lock()) { return; } if (MonoTime::Now() < next_update_time_) { return; } ? http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/ts_tablet_manager.cc@1568 PS6, Line 1568: unique_ptr<ReportedTabletStatsPB> pb(new ReportedTabletStatsPB()) Do these need to be heap-allocated? Could we create the ReportedTableStatsPB on stack and std::move() them into the map? http://gerrit.cloudera.org:8080/#/c/13426/6/src/kudu/tserver/ts_tablet_manager.cc@1600 PS6, Line 1600: for (const auto& tablet_id : dirty_tablets) { : MarkTabletDirty(tablet_id, "The tablet statistics have been changed"); : } > I think it's necessary to add a new function to support batch dirty tablets Hrm, indeed. Alternatively, perhaps these could use: server_->heartbeater()->MarkTabletDirty(tablet_id, reason); to not trigger ASAP? Would that be sufficient? http://gerrit.cloudera.org:8080/#/c/13426/6/www/table.mustache File www/table.mustache: http://gerrit.cloudera.org:8080/#/c/13426/6/www/table.mustache@35 PS6, Line 35: <tr><td>Disk Size:</td><td>{{table_disk_size}}</td></tr> Since this is also a UI change, if you can, could you manually test on a small cluster that these look correct in case we have: - a live tablet - a tombstoned tablet - a deleted tablet ? -- To view, visit http://gerrit.cloudera.org:8080/13426 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I74406ab7cca7c22fda455c328b8ee9989a6b2d99 Gerrit-Change-Number: 13426 Gerrit-PatchSet: 6 Gerrit-Owner: helifu <hzhel...@corp.netease.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: helifu <hzhel...@corp.netease.com> Gerrit-Comment-Date: Wed, 12 Jun 2019 01:17:47 +0000 Gerrit-HasComments: Yes