Re: [PR] Log slow controller events [kafka]

2024-03-28 Thread via GitHub


soarez commented on PR #15622:
URL: https://github.com/apache/kafka/pull/15622#issuecomment-2026356492

   > events have relatively unique names (we include offset in some) so the 
cardinality is quite high
   
   I see. What I meant would result in too many more metrics. Makes sense then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Log slow controller events [kafka]

2024-03-28 Thread via GitHub


mumrah commented on code in PR #15622:
URL: https://github.com/apache/kafka/pull/15622#discussion_r1543844228


##
metadata/src/main/java/org/apache/kafka/controller/metrics/SlowEventsLogger.java:
##
@@ -0,0 +1,58 @@
+package org.apache.kafka.controller.metrics;
+
+import org.apache.kafka.common.utils.Time;
+import org.apache.kafka.common.utils.Timer;
+import org.slf4j.Logger;
+
+import java.util.function.Supplier;
+
+import static java.util.concurrent.TimeUnit.MICROSECONDS;
+import static java.util.concurrent.TimeUnit.MILLISECONDS;
+import static java.util.concurrent.TimeUnit.NANOSECONDS;
+
+public class SlowEventsLogger {
+/**
+ * Don't report any p99 events below this threshold. This prevents the 
controller from reporting p99 event
+ * times in the idle case.
+ */
+private static final int MIN_SLOW_EVENT_TIME_MS = 100;
+
+/**
+ * Calculating the p99 from the histogram consumes some resources, so only 
update it every so often.
+ */
+private static final int P99_REFRESH_INTERVAL_MS = 3;
+
+private final Supplier p99Supplier;
+private final Logger log;
+private final Time time;
+private double p99;
+private Timer percentileUpdateTimer;
+
+public SlowEventsLogger(
+Supplier p99Supplier,
+Logger log,
+Time time
+) {
+this.p99Supplier = p99Supplier;
+this.log = log;
+this.time = time;
+this.percentileUpdateTimer = time.timer(P99_REFRESH_INTERVAL_MS);
+}
+
+public void maybeLog(String name, long durationNs) {
+long durationMs = MILLISECONDS.convert(durationNs, NANOSECONDS);
+if (durationMs > MIN_SLOW_EVENT_TIME_MS && durationMs > p99) {

Review Comment:
   Yea, i need to initialize it in the constructor 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Log slow controller events [kafka]

2024-03-28 Thread via GitHub


mumrah commented on PR #15622:
URL: https://github.com/apache/kafka/pull/15622#issuecomment-2026321417

   @soarez wdym by tagging here? 
   
   Note that the events have relatively unique names (we include offset in 
some) so the cardinality is quite high.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Log slow controller events [kafka]

2024-03-28 Thread via GitHub


soarez commented on code in PR #15622:
URL: https://github.com/apache/kafka/pull/15622#discussion_r1543829135


##
metadata/src/main/java/org/apache/kafka/controller/metrics/SlowEventsLogger.java:
##
@@ -0,0 +1,58 @@
+package org.apache.kafka.controller.metrics;
+
+import org.apache.kafka.common.utils.Time;
+import org.apache.kafka.common.utils.Timer;
+import org.slf4j.Logger;
+
+import java.util.function.Supplier;
+
+import static java.util.concurrent.TimeUnit.MICROSECONDS;
+import static java.util.concurrent.TimeUnit.MILLISECONDS;
+import static java.util.concurrent.TimeUnit.NANOSECONDS;
+
+public class SlowEventsLogger {
+/**
+ * Don't report any p99 events below this threshold. This prevents the 
controller from reporting p99 event
+ * times in the idle case.
+ */
+private static final int MIN_SLOW_EVENT_TIME_MS = 100;
+
+/**
+ * Calculating the p99 from the histogram consumes some resources, so only 
update it every so often.
+ */
+private static final int P99_REFRESH_INTERVAL_MS = 3;
+
+private final Supplier p99Supplier;
+private final Logger log;
+private final Time time;
+private double p99;
+private Timer percentileUpdateTimer;
+
+public SlowEventsLogger(
+Supplier p99Supplier,
+Logger log,
+Time time
+) {
+this.p99Supplier = p99Supplier;
+this.log = log;
+this.time = time;
+this.percentileUpdateTimer = time.timer(P99_REFRESH_INTERVAL_MS);
+}
+
+public void maybeLog(String name, long durationNs) {
+long durationMs = MILLISECONDS.convert(durationNs, NANOSECONDS);
+if (durationMs > MIN_SLOW_EVENT_TIME_MS && durationMs > p99) {

Review Comment:
   It seems possible that `p99 = p99Supplier.get();` may not have run yet at 
this point. Should `p99` have an explicit default value? 



##
metadata/src/main/java/org/apache/kafka/controller/metrics/QuorumControllerMetrics.java:
##
@@ -28,6 +28,8 @@
 import java.util.Optional;
 import java.util.concurrent.atomic.AtomicLong;
 import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.function.Supplier;

Review Comment:
   Is `Supplier` needed here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org