Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21723 )

Change subject: Add Prometheus HTTP service discovery
......................................................................


Patch Set 10: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master-test.cc@4111
PS10, Line 4111: } // namespace master
Since this changelist has added functionality to enable Prometheus service 
discovery working in multi-master Kudu cluster, does it make sense to add a 
test scenario to verify that it works as expected, at least for the very basic, 
smoke-style criteria?


http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc
File src/kudu/master/master_path_handlers.cc:

http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc@984
PS10, Line 984: WriteEmptyPrometheusSDResponse(output);
nit: since the server responds with HttpStatusCode::ServiceUnavailable, sending 
out an empty JSON object isn't exactly necessary, but it shouldn't hurt, I 
guess.  Just curious: did you find that's a requirement to have a valid JSON 
object in the response sent to Prometheus even with HTTP 503?


http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc@988
PS10, Line 988:   if (!l.leader_status().ok()) {
              :     WriteEmptyPrometheusSDResponse(output);
If we are sending back and empty list with HTTP 200 when a particular instance 
of catalog manager isn't a leader, how to avoid situations when Prometheus is 
getting an empty list from all the masters while they are in the process of 
electing a new leader?  Such interval is usually very short (in the order of 1s 
with default Raft heartbeat interval  of 500ms), but it's present.  With that, 
we have a risk of emptying the list of Prometheus metric sources for a Kudu 
cluster just out of the blue for the duration of SD discovery refresh interval 
at least.

Does it look like an issue, and if yes, maybe responding with HTTP 503 or using 
the 'redirect-to-the-leader-master' technique is a safer bet in this context?  
What do you think?



--
To view, visit http://gerrit.cloudera.org:8080/21723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I931aa72a7567c0dde43d7b7ed53a2dd0fa8bc1fe
Gerrit-Change-Number: 21723
Gerrit-PatchSet: 10
Gerrit-Owner: Marton Greber <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Gabriella Lotz <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>
Gerrit-Comment-Date: Tue, 05 Aug 2025 16:43:57 +0000
Gerrit-HasComments: Yes

Reply via email to