Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/21356 )
Change subject: [metrics] Add metrics for tablet copy op time ...................................................................... Patch Set 8: (6 comments) http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tablet/tablet_metrics.cc File src/kudu/tablet/tablet_metrics.cc: http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tablet/tablet_metrics.cc@227 PS8, Line 227: Tablet Copy Operation Duration Would 'Tablet Copy Duration' be good enough? http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tablet/tablet_metrics.cc@229 PS8, Line 229: on this tablet Does it make sense to mention this is the duration as seen from the source tablet replica? http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tablet/tablet_metrics.cc@229 PS8, Line 229: copy tablet tablet copying http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tablet/tablet_metrics.cc@231 PS8, Line 231: 60000000LU Yingchun already pointed at that in PS5, but it seems there is still room for improvement. With current settings, 60000000 stands for maximum value of 1 minute (60 seconds). Are you sure this makes sense? I suspect that for a large tablet replica being copied over a slow network it might take tens of minutes to complete the operation, so an hour for the maximum duration would be prudent. Also, I don't think it makes a lot of sense to have microseconds for the unit for the duration here. I'd think it to be milliseconds at most. http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tserver/tablet_copy_service-test.cc File src/kudu/tserver/tablet_copy_service-test.cc: http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tserver/tablet_copy_service-test.cc@237 PS8, Line 237: TEST_F(TabletCopyMetricTest, TestRunTimeMetricFinishState) { : const auto before_cnt = CopyRunTime()->TotalCount(); : string session_id; : ASSERT_OK(DoBeginValidTabletCopySession(&session_id)); : : EndTabletCopySessionResponsePB resp; : RpcController controller; : ASSERT_OK(DoEndTabletCopySession(session_id, true, nullptr, &resp, &controller)); : ASSERT_EQ(before_cnt + 1, CopyRunTime()->TotalCount()); : } Does it make sense to verify the corresponding metric both at the destination tablet replica as well? I guess it should show 0 for TotalCount(), right? http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tserver/tablet_copy_source_session.cc File src/kudu/tserver/tablet_copy_source_session.cc: http://gerrit.cloudera.org:8080/#/c/21356/8/src/kudu/tserver/tablet_copy_source_session.cc@514 PS8, Line 514: int64_t elapsed_ms = (MonoTime::Now() - start_time_).ToMilliseconds(); : metrics->tablet_copy_duration->Increment(elapsed_ms); In tablet_metrics.cc, the tablet_copy_duration is defined in microseconds is PS8, so this code is inconsistent. -- To view, visit http://gerrit.cloudera.org:8080/21356 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I088f6a9a8a07ad39ca95ae8b4995ce00d1a0d00c Gerrit-Change-Number: 21356 Gerrit-PatchSet: 8 Gerrit-Owner: KeDeng <kdeng...@gmail.com> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: KeDeng <kdeng...@gmail.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Yingchun Lai <laiyingc...@apache.org> Gerrit-Comment-Date: Fri, 17 May 2024 18:14:55 +0000 Gerrit-HasComments: Yes