Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/21316 )
Change subject: [metrics] Add metrics for create and delete op time ...................................................................... Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/21316/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21316/1//COMMIT_MSG@11 PS1, Line 11: These monitoring metrics will be very helpful for analyzing : issues related to high CPU usage. Just curious: did you see that creating/deleting a tablet consume a lot of CPU for some reason? If you are interested in tracking CPU usage, I'd think that more appropriate measurement in this context would be real, system, and user time consumed, not just difference in wall clock time that's captured by the newly introduced metrics, no? http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/tablet_server-test.cc File src/kudu/tserver/tablet_server-test.cc: http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/tablet_server-test.cc@4057 PS1, Line 4057: CHECK_EQ Here and below: use ASSERT in tests, not CHECK http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/tablet_server-test.cc@4132 PS1, Line 4132: LOG(INFO) << "metric delete_tablet_run_time MinValue: " : << delete_tablet_run_time->histogram()->MinValue() : << ", MeanValue: " << delete_tablet_run_time->histogram()->MeanValue() : << ", MaxValue: " << delete_tablet_run_time->histogram()->MaxValue() : << ", TotalCount: " << delete_tablet_run_time->histogram()->TotalCount() : << ", TotalSum: " << delete_tablet_run_time->histogram()->TotalSum(); Why not to use HdrHistogram::DumpHumanReadable() here instead? http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/ts_tablet_manager-test.cc File src/kudu/tserver/ts_tablet_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/ts_tablet_manager-test.cc@151 PS1, Line 151: CHECK_EQ(before_cnt + 1, create_tablet_run_time->TotalCount()); Why to assert this invariant right here? If you want to check the functionality of new newly introduced metric, maybe add a new test scenario to test it and have proper assertion there, but not in this common method? http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/ts_tablet_manager-test.cc@152 PS1, Line 152: LOG(INFO) << "metric create_tablet_run_time MinValue: " : << create_tablet_run_time->histogram()->MinValue() : << ", MeanValue: " << create_tablet_run_time->histogram()->MeanValue() : << ", MaxValue: " << create_tablet_run_time->histogram()->MaxValue() : << ", TotalCount: " << create_tablet_run_time->histogram()->TotalCount() : << ", TotalSum: " << create_tablet_run_time->histogram()->TotalSum(); Is it possible to use HdrHistogram::DumpHumanReadable() here instead? BTW, why to add this? Our current tests are quite verbose already, so I'd think twice before adding extra output which isn't used to distinguish between passing and failing tests programmatically or useful for troubleshooting/debugging of failing tests by humans. http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/21316/1/src/kudu/tserver/ts_tablet_manager.cc@674 PS1, Line 674: MonoTime time_started = MonoTime::Now(); If you include time taken to open a newly created tablet into the whole measurement, then why not to include the time taken to acquire tablet's manager lock as well? In other words, why not to sample 'time_started' at line 654? -- To view, visit http://gerrit.cloudera.org:8080/21316 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02bd52013caa94a33143cb16ff3831a49b74bac4 Gerrit-Change-Number: 21316 Gerrit-PatchSet: 1 Gerrit-Owner: KeDeng <kdeng...@gmail.com> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Comment-Date: Wed, 17 Apr 2024 07:21:31 +0000 Gerrit-HasComments: Yes