[jira] [Updated] (KAFKA-19888) Coordinator histogram negative values causing persistent write timeouts and consumer instability in Kafka 4.1.0

Eliel Gomes (Jira) Fri, 14 Nov 2025 06:33:45 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-19888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eliel Gomes updated KAFKA-19888:
--------------------------------
    Description: 
h2. Summary

In Kafka 4.1.0, a broker entered a corrupted state where the Coordinator began 
producing {*}negative histogram values{*}, causing persistent write timeouts to 
{{{}__consumer_offsets{}}}, continuous group coordinator instability, and 
repeated consumer disconnections.

The issue did *not* self-heal after time synchronization stabilized and only 
cleared after restarting the affected broker.

This suggests that the Coordinator histogram/timer subsystem becomes invalid 
after certain time irregularities and does not recover automatically.

 
----
h2. *Description*

We observed repeated Coordinator failures on a single broker in a production 
Kafka 4.1.0 cluster. The internal histogram used by the Coordinator runtime 
began to record {*}negative values{*}, which should be impossible under correct 
operation.

Once this occurs:
 * Coordinator write events begin to timeout repeatedly

 * Consumers are disconnected and reconnected frequently

 * Group coordination is unstable

 * Commits failing on __consumer_offsets
 * Lag increasing ("fake" lag), because of failures in commits
 * The broker remains in a broken state even after the system clock stabilizes

 * Only restarting the broker clears the condition

 * After the restart, consumer groups immediately select the same broker again 
as coordinator, and the problem eventually reappears

This behavior suggests state corruption inside the Coordinator's 
timer/histogram logic, where the {{GroupCoordinator}} component is susceptible 
to a fatal {{ArrayIndexOutOfBoundsException}} when attempting to record a 
negative value in an internal histogram.

This bug appears to be triggered by a non-monotonic clock adjustment (i.e., the 
system clock moving backward).

When this exception occurs, the {{group-coordinator-event-processor}} thread 
fails, specifically when processing internal events like 
{{{*}HighWatermarkUpdate{*}.}}

The failures cause of offset *commit failures* from {*}consumers{*}.

 
----
h2. *Actual Behavior*
 * Coordinator internal histograms observe *negative values*

 * The Coordinator *fails to write to {{__consumer_offsets}}*

 * Consumers experience recurring instability

 * Kafka does not self-recover even when system time becomes stable

 * Only a broker restart resolves the condition

 
----
h2. *Expected behavior*

Kafka should:
 * Reject or guard against negative deltas in histogram/timer calculations

 * Prevent a corrupted Coordinator state from persisting

 * Avoid selecting a malfunctioning broker as group coordinator

 
----
h2. *Impact*
 * Critical production consumer instability

 * Continuous rebalances

 * Offset commit delays

 * 
 ** "Fake Lags"
 * Persistent coordinator malfunction

 * Requires broker restarts to recover

 * High operational risk

----
h2. *LOGS*

*Server-Side:* The {{GroupCoordinator}} broker log is flooded with the 
following exception, indicating the event processor thread is failing:

 
*Kafka Log - 2025-11-12 at 14:56*
{code:java}
[2025-11-12 14:56:38,166] ERROR [group-coordinator-event-processor-0]: Failed 
to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram recorded 
value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
[2025-11-12 14:56:38,167] ERROR [GroupCoordinator id=5] Execution of 
HighWatermarkUpdate failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
...
[2025-11-12 14:58:38,170] ERROR [group-coordinator-event-processor-2]: Failed 
to run event InternalEvent(name=FlushBatch) due to: Histogram recorded value 
cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
[2025-11-12 14:58:38,170] ERROR [GroupCoordinator id=5] Execution of FlushBatch 
failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.{code}
 

*Kafka Log - 2025-11-13 09:15*
{code:java}
[2025-11-13 09:15:39,777] ERROR [group-coordinator-event-processor-0]: Failed 
to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram recorded 
value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
        at 
org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
~[HdrHistogram-2.2.2.jar:?]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
~[HdrHistogram-2.2.2.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
~[kafka-clients-4.1.0.jar:?]
        at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
~[kafka-clients-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
 [kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
 [kafka-coordinator-common-4.1.0.jar:?]
[2025-11-13 09:15:39,777] ERROR [GroupCoordinator id=5] Execution of 
HighWatermarkUpdate failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
        at 
org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
~[HdrHistogram-2.2.2.jar:?]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
~[HdrHistogram-2.2.2.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
~[kafka-clients-4.1.0.jar:?]
        at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
~[kafka-clients-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
 [kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
 [kafka-coordinator-common-4.1.0.jar:?] {code}
 

*Client-Side:* Consumers associated with this (failed) {{GroupCoordinator}} 
begin to fail their offset commits. The client logs show a continuous loop 
where the consumer:
 * Tries to commit and fails with {{This is not the correct coordinator.}}

 * Receives an error response {{{}NOT_COORDINATOR{}}}.

 * Disconnects and attempts rediscovery.

 * Discovers the _exact same_ (broken) broker as the coordinator and repeats 
the cycle.

 
*Consumer Log - 2025-11-12 at 14:58*
{code:java}
2025-11-13 14:58:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Offset commit failed on partition SEND_SMS_LOW_PRIORITY-31 at offset 2291423: 
The coordinator is not available.
2025-11-13 14:58:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-12 14:58:46.388 INFO [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator 10.0.01:9092 (id: 2147483642 
rack: null) 2025-11-12 14:58:46.395 INFO [o.apache.kafka.clients.NetworkClient] 
[] - [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642 2025-11-12 14:58:46.396 INFO [o.apache.kafka.clients.NetworkClient] 
[] - [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT request 
with correlation id 3436277 due to node 2147483642 being disconnected (elapsed 
time since creation: 12ms, elapsed time since send: 12ms, throttle time: 0ms, 
request timeout: 30000ms) 2025-11-12 14:58:46.677 INFO 
[o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) {code}
 

*Consumer Log - 2025-11-13 at 09:15*
{code:java}
2025-11-13 09:15:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Offset commit failed on partition SEND_SMS_LOW_PRIORITY-31 at offset 2291423: 
The coordinator is not available.
2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator X.X.X.X:9092 (id: 2147483642 
rack: null)
2025-11-13 09:15:46.732 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642
2025-11-13 09:15:46.949 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null)
2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator X.X.X.X:9092 (id: 2147483642 
rack: null)
2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642
2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT request 
with correlation id 3557487 due to node 2147483642 being disconnected (elapsed 
time since creation: 4908ms, elapsed time since send: 4908ms, throttle time: 
0ms, request timeout: 30000ms)
2025-11-13 09:15:52.086 INFO  [SendAggregateTemplateRoute] [] - Grouped 2 
messages.
2025-11-13 09:15:52.132 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) {code}
 
----
 

*Workaround (Temporary Fix):*
 # Restarting the affected broker to force a new {{GroupCoordinator}} election 
for the affected consumer groups.

 # 
 ## After the broker restarts, consumers will once again discover it as Group 
Coordinator.

 
----
h2. *NTP*

Brokers of Kafka Cluster and Clients Consumers are with NTP service active
{code:java}
timedatectl
               Local time: Fri 2025-11-14 11:07:49 -03
           Universal time: Fri 2025-11-14 14:07:49 UTC
                 RTC time: Fri 2025-11-14 14:07:49
                Time zone: America/Sao_Paulo (-03, -0300)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no {code}
 
----
h2. *Request for Analysis / Questions*
 # Does Kafka 4.1.0 require strictly monotonic time for Coordinator 
histogram/timer logic?

 # Is histogram corruption (negative values) a known issue in 4.x?

 # Should Kafka validate histogram deltas to prevent negative values?

 # Is this fixed in newer Kafka releases?

 # Should CoordinatorRuntime reset its internal state when encountering time 
irregularities?

 # Is there a recommended workaround to avoid Coordinator entering this 
permanent error state?

 

 

 

  was:
h2. Summary

In Kafka 4.1.0, a broker entered a corrupted state where the Coordinator began 
producing {*}negative histogram values{*}, causing persistent write timeouts to 
{{{}__consumer_offsets{}}}, continuous group coordinator instability, and 
repeated consumer disconnections.

The issue did *not* self-heal after time synchronization stabilized and only 
cleared after restarting the affected broker.

This suggests that the Coordinator histogram/timer subsystem becomes invalid 
after certain time irregularities and does not recover automatically.

 
----
h2. *Description*

We observed repeated Coordinator failures on a single broker in a production 
Kafka 4.1.0 cluster. The internal histogram used by the Coordinator runtime 
began to record {*}negative values{*}, which should be impossible under correct 
operation.

Once this occurs:
 * Coordinator write events begin to timeout repeatedly

 * Consumers are disconnected and reconnected frequently

 * Group coordination is unstable

 * Commits failing on __consumer_offsets
 * Lag increasing ("fake" lag), because of failures in commits
 * The broker remains in a broken state even after the system clock stabilizes

 * Only restarting the broker clears the condition

 * After the restart, consumer groups immediately select the same broker again 
as coordinator, and the problem eventually reappears

This behavior suggests state corruption inside the Coordinator's 
timer/histogram logic, where the {{GroupCoordinator}} component is susceptible 
to a fatal {{ArrayIndexOutOfBoundsException}} when attempting to record a 
negative value in an internal histogram.

This bug appears to be triggered by a non-monotonic clock adjustment (i.e., the 
system clock moving backward).

When this exception occurs, the {{group-coordinator-event-processor}} thread 
fails, specifically when processing internal events like 
{{{*}HighWatermarkUpdate{*}.}}

The failures cause of offset *commit failures* from {*}consumers{*}.

 
----
h2. *Actual Behavior*
 * Coordinator internal histograms observe *negative values*

 * The Coordinator *fails to write to {{__consumer_offsets}}*

 * Consumers experience recurring instability

 * Kafka does not self-recover even when system time becomes stable

 * Only a broker restart resolves the condition

 
----
h2. *Expected behavior*

Kafka should:
 * Reject or guard against negative deltas in histogram/timer calculations

 * Prevent a corrupted Coordinator state from persisting

 * Avoid selecting a malfunctioning broker as group coordinator

 
----
h2. *Impact*
 * Critical production consumer instability

 * Continuous rebalances

 * Offset commit delays

 ** "Fake Lags"
 * Persistent coordinator malfunction

 * Requires broker restarts to recover

 * High operational risk

----
h2. *LOGS*

*Server-Side:* The {{GroupCoordinator}} broker log is flooded with the 
following exception, indicating the event processor thread is failing:

 
*Kafka Log - 2025-11-12 at 14:56*
 
{code:java}
[2025-11-12 14:56:38,166] ERROR [group-coordinator-event-processor-0]: Failed 
to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram recorded 
value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
[2025-11-12 14:56:38,167] ERROR [GroupCoordinator id=5] Execution of 
HighWatermarkUpdate failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
...
[2025-11-12 14:58:38,170] ERROR [group-coordinator-event-processor-2]: Failed 
to run event InternalEvent(name=FlushBatch) due to: Histogram recorded value 
cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
[2025-11-12 14:58:38,170] ERROR [GroupCoordinator id=5] Execution of FlushBatch 
failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.{code}
 

 

*Kafka Log - 2025-11-13 09:15*

 
{code:java}
[2025-11-13 09:15:39,777] ERROR [group-coordinator-event-processor-0]: Failed 
to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram recorded 
value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
        at 
org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
~[HdrHistogram-2.2.2.jar:?]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
~[HdrHistogram-2.2.2.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
~[kafka-clients-4.1.0.jar:?]
        at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
~[kafka-clients-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
 [kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
 [kafka-coordinator-common-4.1.0.jar:?]
[2025-11-13 09:15:39,777] ERROR [GroupCoordinator id=5] Execution of 
HighWatermarkUpdate failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
        at 
org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
~[HdrHistogram-2.2.2.jar:?]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
~[HdrHistogram-2.2.2.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
~[kafka-clients-4.1.0.jar:?]
        at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
~[kafka-clients-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
 [kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
 [kafka-coordinator-common-4.1.0.jar:?] {code}
 

*Client-Side:* Consumers associated with this (failed) {{GroupCoordinator}} 
begin to fail their offset commits. The client logs show a continuous loop 
where the consumer:
 * Tries to commit and fails with {{This is not the correct coordinator.}}

 * Receives an error response {{{}NOT_COORDINATOR{}}}.

 * Disconnects and attempts rediscovery.

 * Discovers the _exact same_ (broken) broker as the coordinator and repeats 
the cycle.

 
*Consumer Log - 2025-11-12 at 14:58*

 
{code:java}
2025-11-13 14:58:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Offset commit failed on partition SEND_SMS_LOW_PRIORITY-31 at offset 2291423: 
The coordinator is not available.
2025-11-13 14:58:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-12 14:58:46.388 INFO [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator 10.0.01:9092 (id: 2147483642 
rack: null) 2025-11-12 14:58:46.395 INFO [o.apache.kafka.clients.NetworkClient] 
[] - [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642 2025-11-12 14:58:46.396 INFO [o.apache.kafka.clients.NetworkClient] 
[] - [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT request 
with correlation id 3436277 due to node 2147483642 being disconnected (elapsed 
time since creation: 12ms, elapsed time since send: 12ms, throttle time: 0ms, 
request timeout: 30000ms) 2025-11-12 14:58:46.677 INFO 
[o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) {code}
 

 

*Consumer Log - 2025-11-13 at 09:15*

 
{code:java}
2025-11-13 09:15:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Offset commit failed on partition SEND_SMS_LOW_PRIORITY-31 at offset 2291423: 
The coordinator is not available.
2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator X.X.X.X:9092 (id: 2147483642 
rack: null)
2025-11-13 09:15:46.732 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642
2025-11-13 09:15:46.949 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null)
2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator X.X.X.X:9092 (id: 2147483642 
rack: null)
2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642
2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT request 
with correlation id 3557487 due to node 2147483642 being disconnected (elapsed 
time since creation: 4908ms, elapsed time since send: 4908ms, throttle time: 
0ms, request timeout: 30000ms)
2025-11-13 09:15:52.086 INFO  [SendAggregateTemplateRoute] [] - Grouped 2 
messages.
2025-11-13 09:15:52.132 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) {code}
 
----
 

*Workaround (Temporary Fix):*
 # Restarting the affected broker to force a new {{GroupCoordinator}} election 
for the affected consumer groups.

 ## After the broker restarts, consumers will once again discover it as Group 
Coordinator.

 
----
h2. *NTP*

Brokers of Kafka Cluster and Clients Consumers are with NTP service active
{code:java}
timedatectl
               Local time: Fri 2025-11-14 11:07:49 -03
           Universal time: Fri 2025-11-14 14:07:49 UTC
                 RTC time: Fri 2025-11-14 14:07:49
                Time zone: America/Sao_Paulo (-03, -0300)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no {code}
 
----
h2. *Request for Analysis / Questions*
 # Does Kafka 4.1.0 require strictly monotonic time for Coordinator 
histogram/timer logic?

 # Is histogram corruption (negative values) a known issue in 4.x?

 # Should Kafka validate histogram deltas to prevent negative values?

 # Is this fixed in newer Kafka releases?

 # Should CoordinatorRuntime reset its internal state when encountering time 
irregularities?

 # Is there a recommended workaround to avoid Coordinator entering this 
permanent error state?

 

 

 


> Coordinator histogram negative values causing persistent write timeouts and 
> consumer instability in Kafka 4.1.0
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19888
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19888
>             Project: Kafka
>          Issue Type: Bug
>          Components: group-coordinator
>    Affects Versions: 4.1.0
>         Environment: Kafka cluster (KRaft) with 9 brokers
> RHEL-based OS (AlmaLinux 9.6) 
> Running on VMware vSphere
> NTP service active on VMs
> VMware Tools had "Synchronize time periodically" enabled
>            Reporter: Eliel Gomes
>            Priority: Critical
>
> h2. Summary
> In Kafka 4.1.0, a broker entered a corrupted state where the Coordinator 
> began producing {*}negative histogram values{*}, causing persistent write 
> timeouts to {{{}__consumer_offsets{}}}, continuous group coordinator 
> instability, and repeated consumer disconnections.
> The issue did *not* self-heal after time synchronization stabilized and only 
> cleared after restarting the affected broker.
> This suggests that the Coordinator histogram/timer subsystem becomes invalid 
> after certain time irregularities and does not recover automatically.
>  
> ----
> h2. *Description*
> We observed repeated Coordinator failures on a single broker in a production 
> Kafka 4.1.0 cluster. The internal histogram used by the Coordinator runtime 
> began to record {*}negative values{*}, which should be impossible under 
> correct operation.
> Once this occurs:
>  * Coordinator write events begin to timeout repeatedly
>  * Consumers are disconnected and reconnected frequently
>  * Group coordination is unstable
>  * Commits failing on __consumer_offsets
>  * Lag increasing ("fake" lag), because of failures in commits
>  * The broker remains in a broken state even after the system clock stabilizes
>  * Only restarting the broker clears the condition
>  * After the restart, consumer groups immediately select the same broker 
> again as coordinator, and the problem eventually reappears
> This behavior suggests state corruption inside the Coordinator's 
> timer/histogram logic, where the {{GroupCoordinator}} component is 
> susceptible to a fatal {{ArrayIndexOutOfBoundsException}} when attempting to 
> record a negative value in an internal histogram.
> This bug appears to be triggered by a non-monotonic clock adjustment (i.e., 
> the system clock moving backward).
> When this exception occurs, the {{group-coordinator-event-processor}} thread 
> fails, specifically when processing internal events like 
> {{{*}HighWatermarkUpdate{*}.}}
> The failures cause of offset *commit failures* from {*}consumers{*}.
>  
> ----
> h2. *Actual Behavior*
>  * Coordinator internal histograms observe *negative values*
>  * The Coordinator *fails to write to {{__consumer_offsets}}*
>  * Consumers experience recurring instability
>  * Kafka does not self-recover even when system time becomes stable
>  * Only a broker restart resolves the condition
>  
> ----
> h2. *Expected behavior*
> Kafka should:
>  * Reject or guard against negative deltas in histogram/timer calculations
>  * Prevent a corrupted Coordinator state from persisting
>  * Avoid selecting a malfunctioning broker as group coordinator
>  
> ----
> h2. *Impact*
>  * Critical production consumer instability
>  * Continuous rebalances
>  * Offset commit delays
>  * 
>  ** "Fake Lags"
>  * Persistent coordinator malfunction
>  * Requires broker restarts to recover
>  * High operational risk
> ----
> h2. *LOGS*
> *Server-Side:* The {{GroupCoordinator}} broker log is flooded with the 
> following exception, indicating the event processor thread is failing:
>  
> *Kafka Log - 2025-11-12 at 14:56*
> {code:java}
> [2025-11-12 14:56:38,166] ERROR [group-coordinator-event-processor-0]: Failed 
> to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram 
> recorded value cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
> [2025-11-12 14:56:38,167] ERROR [GroupCoordinator id=5] Execution of 
> HighWatermarkUpdate failed due to Histogram recorded value cannot be 
> negative.. (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
> ...
> [2025-11-12 14:58:38,170] ERROR [group-coordinator-event-processor-2]: Failed 
> to run event InternalEvent(name=FlushBatch) due to: Histogram recorded value 
> cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
> [2025-11-12 14:58:38,170] ERROR [GroupCoordinator id=5] Execution of 
> FlushBatch failed due to Histogram recorded value cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.{code}
>  
> *Kafka Log - 2025-11-13 09:15*
> {code:java}
> [2025-11-13 09:15:39,777] ERROR [group-coordinator-event-processor-0]: Failed 
> to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram 
> recorded value cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
>         at 
> org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
> ~[kafka-clients-4.1.0.jar:?]
>         at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
> ~[kafka-clients-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
>  [kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
>  [kafka-coordinator-common-4.1.0.jar:?]
> [2025-11-13 09:15:39,777] ERROR [GroupCoordinator id=5] Execution of 
> HighWatermarkUpdate failed due to Histogram recorded value cannot be 
> negative.. (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
>         at 
> org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
> ~[kafka-clients-4.1.0.jar:?]
>         at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
> ~[kafka-clients-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
>  [kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
>  [kafka-coordinator-common-4.1.0.jar:?] {code}
>  
> *Client-Side:* Consumers associated with this (failed) {{GroupCoordinator}} 
> begin to fail their offset commits. The client logs show a continuous loop 
> where the consumer:
>  * Tries to commit and fails with {{This is not the correct coordinator.}}
>  * Receives an error response {{{}NOT_COORDINATOR{}}}.
>  * Disconnects and attempts rediscovery.
>  * Discovers the _exact same_ (broken) broker as the coordinator and repeats 
> the cycle.
>  
> *Consumer Log - 2025-11-12 at 14:58*
> {code:java}
> 2025-11-13 14:58:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Offset commit failed on partition 
> SEND_SMS_LOW_PRIORITY-31 at offset 2291423: The coordinator is not available.
> 2025-11-13 14:58:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Group coordinator X.X.X.X:9092 (id: 
> 2147483642 rack: null) is unavailable or invalid due to cause: error response 
> COORDINATOR_NOT_AVAILABLE. isDisconnected: false. Rediscovery will be 
> attempted.
> 2025-11-12 14:58:46.388 INFO [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Requesting disconnect from last known 
> coordinator 10.0.01:9092 (id: 2147483642 rack: null) 2025-11-12 14:58:46.395 
> INFO [o.apache.kafka.clients.NetworkClient] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
> 2147483642 2025-11-12 14:58:46.396 INFO 
> [o.apache.kafka.clients.NetworkClient] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT 
> request with correlation id 3436277 due to node 2147483642 being disconnected 
> (elapsed time since creation: 12ms, elapsed time since send: 12ms, throttle 
> time: 0ms, request timeout: 30000ms) 2025-11-12 14:58:46.677 INFO 
> [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Discovered group coordinator 
> X.X.X.X:9092 (id: 2147483642 rack: null) {code}
>  
> *Consumer Log - 2025-11-13 at 09:15*
> {code:java}
> 2025-11-13 09:15:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Offset commit failed on partition 
> SEND_SMS_LOW_PRIORITY-31 at offset 2291423: The coordinator is not available.
> 2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Group coordinator X.X.X.X:9092 (id: 
> 2147483642 rack: null) is unavailable or invalid due to cause: error response 
> COORDINATOR_NOT_AVAILABLE. isDisconnected: false. Rediscovery will be 
> attempted.
> 2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Requesting disconnect from last known 
> coordinator X.X.X.X:9092 (id: 2147483642 rack: null)
> 2025-11-13 09:15:46.732 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
> 2147483642
> 2025-11-13 09:15:46.949 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Discovered group coordinator 
> X.X.X.X:9092 (id: 2147483642 rack: null)
> 2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Group coordinator X.X.X.X:9092 (id: 
> 2147483642 rack: null) is unavailable or invalid due to cause: error response 
> COORDINATOR_NOT_AVAILABLE. isDisconnected: false. Rediscovery will be 
> attempted.
> 2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Requesting disconnect from last known 
> coordinator X.X.X.X:9092 (id: 2147483642 rack: null)
> 2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
> 2147483642
> 2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT 
> request with correlation id 3557487 due to node 2147483642 being disconnected 
> (elapsed time since creation: 4908ms, elapsed time since send: 4908ms, 
> throttle time: 0ms, request timeout: 30000ms)
> 2025-11-13 09:15:52.086 INFO  [SendAggregateTemplateRoute] [] - Grouped 2 
> messages.
> 2025-11-13 09:15:52.132 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Discovered group coordinator 
> X.X.X.X:9092 (id: 2147483642 rack: null) {code}
>  
> ----
>  
> *Workaround (Temporary Fix):*
>  # Restarting the affected broker to force a new {{GroupCoordinator}} 
> election for the affected consumer groups.
>  # 
>  ## After the broker restarts, consumers will once again discover it as Group 
> Coordinator.
>  
> ----
> h2. *NTP*
> Brokers of Kafka Cluster and Clients Consumers are with NTP service active
> {code:java}
> timedatectl
>                Local time: Fri 2025-11-14 11:07:49 -03
>            Universal time: Fri 2025-11-14 14:07:49 UTC
>                  RTC time: Fri 2025-11-14 14:07:49
>                 Time zone: America/Sao_Paulo (-03, -0300)
> System clock synchronized: yes
>               NTP service: active
>           RTC in local TZ: no {code}
>  
> ----
> h2. *Request for Analysis / Questions*
>  # Does Kafka 4.1.0 require strictly monotonic time for Coordinator 
> histogram/timer logic?
>  # Is histogram corruption (negative values) a known issue in 4.x?
>  # Should Kafka validate histogram deltas to prevent negative values?
>  # Is this fixed in newer Kafka releases?
>  # Should CoordinatorRuntime reset its internal state when encountering time 
> irregularities?
>  # Is there a recommended workaround to avoid Coordinator entering this 
> permanent error state?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-19888) Coordinator histogram negative values causing persistent write timeouts and consumer instability in Kafka 4.1.0

Reply via email to