[ 
https://issues.apache.org/jira/browse/KAFKA-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Telford updated KAFKA-17954:
-------------------------------------
    Description: 
In 
[KIP-989|https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+Improved+StateStore+Iterator+metrics+for+detecting+leaks]
 we introduced a new metric, {{{}oldest-iterator-open-since-ms{}}}, which 
reports the timestamp that the oldest currently open KeyValueIterator was 
opened at.

On-scrape, we sometimes see this {{WARN}} log message:
{noformat}
Error getting JMX attribute 'oldest-iterator-open-since-ms'
java.util.NoSuchElementException
at 
java.base/java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:1859)
at 
java.base/java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at 
org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$registerMetrics$5(MeteredKeyValueStore.java:179){noformat}
-However, if no iterators are currently open, this Gauge returns {{{}null{}}}.-

-When using the Prometheus {{JmxScraper}} to scrape this metric, its value is 
added to a {{{}ConcurrentHashMap{}}}, which does _not_ permit {{null}} values.-

-We should find some other way to report the absence of this metric that does 
not cause problems with {{{}ConcurrentHashMap{}}}.-

My initial analysis was incorrect. The problem appears to be caused by the 
{{openIterators}} Set in {{{}MeteredKeyValueStore{}}}:
{noformat}
    protected NavigableSet<MeteredIterator> openIterators = new 
ConcurrentSkipListSet<>(Comparator.comparingLong(MeteredIterator::startTimestamp));
 {noformat}
This is used by the Gauge to report the metric:
{noformat}
openIterators.isEmpty() ? null : openIterators.first().startTimestamp() 
{noformat}
The source of the exception is the right-hand side of this ternary expression, 
specifically {{{}openIterators.first(){}}}.

The condition of this expression should ensure that there is at least one 
element to retrieve by the right-hand side. *However, if the last Iterator is 
removed from this Set concurrently to the Gauge being reported, after the 
emptiness check, but before retrieving the element, we can throw the above 
exception here.*

This can happen because interactive queries and stream threads operate 
concurrently from the thread that reads the Gauge to report metrics.

  was:
In 
[KIP-989|https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+Improved+StateStore+Iterator+metrics+for+detecting+leaks]
 we introduced a new metric, {{{}oldest-iterator-open-since-ms{}}}, which 
reports the timestamp that the oldest currently open KeyValueIterator was 
opened at.

However, if no iterators are currently open, this Gauge returns {{{}null{}}}.

When using the Prometheus {{JmxScraper}} to scrape this metric, its value is 
added to a {{{}ConcurrentHashMap{}}}, which does _not_ permit {{null}} values.

Consequently, on-scrape, we see this {{WARN}} log message:
{noformat}
Error getting JMX attribute 'oldest-iterator-open-since-ms' {noformat}
We should find some other way to report the absence of this metric that does 
not cause problems with {{{}ConcurrentHashMap{}}}.


> Error getting oldest-iterator-open-since-ms from JMX
> ----------------------------------------------------
>
>                 Key: KAFKA-17954
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17954
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.1
>            Reporter: Nicholas Telford
>            Assignee: Nicholas Telford
>            Priority: Minor
>
> In 
> [KIP-989|https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+Improved+StateStore+Iterator+metrics+for+detecting+leaks]
>  we introduced a new metric, {{{}oldest-iterator-open-since-ms{}}}, which 
> reports the timestamp that the oldest currently open KeyValueIterator was 
> opened at.
> On-scrape, we sometimes see this {{WARN}} log message:
> {noformat}
> Error getting JMX attribute 'oldest-iterator-open-since-ms'
> java.util.NoSuchElementException
> at 
> java.base/java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:1859)
> at 
> java.base/java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
> at 
> org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$registerMetrics$5(MeteredKeyValueStore.java:179){noformat}
> -However, if no iterators are currently open, this Gauge returns 
> {{{}null{}}}.-
> -When using the Prometheus {{JmxScraper}} to scrape this metric, its value is 
> added to a {{{}ConcurrentHashMap{}}}, which does _not_ permit {{null}} 
> values.-
> -We should find some other way to report the absence of this metric that does 
> not cause problems with {{{}ConcurrentHashMap{}}}.-
> My initial analysis was incorrect. The problem appears to be caused by the 
> {{openIterators}} Set in {{{}MeteredKeyValueStore{}}}:
> {noformat}
>     protected NavigableSet<MeteredIterator> openIterators = new 
> ConcurrentSkipListSet<>(Comparator.comparingLong(MeteredIterator::startTimestamp));
>  {noformat}
> This is used by the Gauge to report the metric:
> {noformat}
> openIterators.isEmpty() ? null : openIterators.first().startTimestamp() 
> {noformat}
> The source of the exception is the right-hand side of this ternary 
> expression, specifically {{{}openIterators.first(){}}}.
> The condition of this expression should ensure that there is at least one 
> element to retrieve by the right-hand side. *However, if the last Iterator is 
> removed from this Set concurrently to the Gauge being reported, after the 
> emptiness check, but before retrieving the element, we can throw the above 
> exception here.*
> This can happen because interactive queries and stream threads operate 
> concurrently from the thread that reads the Gauge to report metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to