Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/21723 )
Change subject: Add Prometheus HTTP service discovery ...................................................................... Patch Set 10: Code-Review+1 (3 comments) http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master-test.cc File src/kudu/master/master-test.cc: http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master-test.cc@4111 PS10, Line 4111: } // namespace master Since this changelist has added functionality to enable Prometheus service discovery working in multi-master Kudu cluster, does it make sense to add a test scenario to verify that it works as expected, at least for the very basic, smoke-style criteria? http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc File src/kudu/master/master_path_handlers.cc: http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc@984 PS10, Line 984: WriteEmptyPrometheusSDResponse(output); nit: since the server responds with HttpStatusCode::ServiceUnavailable, sending out an empty JSON object isn't exactly necessary, but it shouldn't hurt, I guess. Just curious: did you find that's a requirement to have a valid JSON object in the response sent to Prometheus even with HTTP 503? http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc@988 PS10, Line 988: if (!l.leader_status().ok()) { : WriteEmptyPrometheusSDResponse(output); If we are sending back and empty list with HTTP 200 when a particular instance of catalog manager isn't a leader, how to avoid situations when Prometheus is getting an empty list from all the masters while they are in the process of electing a new leader? Such interval is usually very short (in the order of 1s with default Raft heartbeat interval of 500ms), but it's present. With that, we have a risk of emptying the list of Prometheus metric sources for a Kudu cluster just out of the blue for the duration of SD discovery refresh interval at least. Does it look like an issue, and if yes, maybe responding with HTTP 503 or using the 'redirect-to-the-leader-master' technique is a safer bet in this context? What do you think? -- To view, visit http://gerrit.cloudera.org:8080/21723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I931aa72a7567c0dde43d7b7ed53a2dd0fa8bc1fe Gerrit-Change-Number: 21723 Gerrit-PatchSet: 10 Gerrit-Owner: Marton Greber <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Gabriella Lotz <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Reviewer: Zoltan Chovan <[email protected]> Gerrit-Reviewer: Zoltan Martonka <[email protected]> Gerrit-Comment-Date: Tue, 05 Aug 2025 16:43:57 +0000 Gerrit-HasComments: Yes
