Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

Nick Telford Thu, 28 Mar 2024 06:53:43 -0700

Quick addendum:

My suggested metric "oldest-open-iterator-age-seconds" should be
"oldest-open-iterator-age-ms". Milliseconds is obviously a better
granularity for such a metric.


Still accepting suggestions for a better name.

On Thu, 28 Mar 2024 at 13:41, Nick Telford <nick.telf...@gmail.com> wrote:

> Hi everyone,
>
> Sorry for leaving this for so long. So much for "3 weeks until KIP freeze"!
>
> On Sophie's comments:
> 1. Would Matthias's suggestion of a separate metric tracking the age of
> the oldest open iterator (within the tag set) satisfy this? That way we can
> keep iterator-duration-(avg|max) for closed iterators, which can be useful
> for performance debugging for iterators that don't leak. I'm not sure what
> we'd call this metric, maybe: "oldest-open-iterator-age-seconds"? Seems
> like a mouthful.
>
> 2. You're right, it makes more sense to provide
> iterator-duration-(avg|max). Honestly, I can't remember why I had "total"
> before, or why I was computing a rate-of-change over it.
>
> 3, 4, 5, 6. Agreed, I'll make all those changes as suggested.
>
> 7. Combined with Matthias's point about RocksDB, I'm convinced that this
> is the wrong KIP for these. I'll introduce the additional Rocks metrics in
> another KIP.
>
> On Matthias's comments:
> A. Not sure about the time window. I'm pretty sure all existing avg/max
> metrics are since the application was started? Any other suggestions here
> would be appreciated.
>
> B. Agreed. See point 1 above.
>
> C. Good point. My focus was very much on Rocks memory leaks when I wrote
> the first draft. I can generalise it. My only concern is that it might make
> it more difficult to detect Rocks iterator leaks caused *within* our
> high-level iterator, e.g. RocksJNI, RocksDB, RocksDBStore, etc. But we
> could always provide a RocksDB-specific metric for this, as you suggested.
>
> D. Hmm, we do already have MeteredKeyValueIterator, which automatically
> wraps the iterator from inner-stores of MeteredKeyValueStore. If we
> implemented these metrics there, then custom stores would automatically
> gain the functionality, right? This seems like a pretty logical place to
> implement these metrics, since MeteredKeyValueStore is all about adding
> metrics to state stores.
>
> > I imagine the best way to implement this would be to do so at the
> > high-level iterator rather than implementing it separately for each
> > specific iterator implementation for every store type.
>
> Sophie, does MeteredKeyValueIterator fit with your recommendation?
>
> Thanks for your thoughts everyone, I'll update the KIP now.
>
> Nick
>
> On Thu, 14 Mar 2024 at 03:37, Sophie Blee-Goldman <sop...@responsive.dev>
> wrote:
>
>> About your last two points: I completely agree that we should try to
>> make this independent of RocksDB, and should probably adopt a
>> general philosophy of being store-implementation agnostic unless
>> there is good reason to focus on a particular store type: eg if it was
>> only possible to implement for certain stores, or only made sense in
>> the context of a certain store type but not necessarily stores in general.
>>
>> While leaking memory due to unclosed iterators on RocksDB stores is
>> certainly the most common issue, I think Matthias sufficiently
>> demonstrated that the problem of leaking iterators is not actually
>> unique to RocksDB, and we should consider including in-memory
>> stores at the very least. I also think that at this point, we may as well
>> just implement the metrics for *all* store types, whether rocksdb or
>> in-memory or custom. Not just because it probably applies to all
>> store types (leaking iterators are rarely a good thing!) but because
>> I imagine the best way to implement this would be to do so at the
>> high-level iterator rather than implementing it separately for each
>> specific iterator implementation for every store type.
>>
>> That said, I haven't thought all that carefully about the implementation
>> yet -- it just strikes me as easiest to do at the top level of the store
>> hierarchy rather than at the bottom. My gut instinct may very well be
>> wrong, but that's what it's saying
>>
>> On Thu, Mar 7, 2024 at 10:43 AM Matthias J. Sax <mj...@apache.org> wrote:
>>
>> > Seems I am late to this party. Can we pick this up again aiming for 3.8
>> > release? I think it would be a great addition. Few comments:
>> >
>> >
>> > - I think it does make sense to report `iterator-duration-avg` and
>> > `iterator-duration-max` for all *closed* iterators -- it just seems to
>> > be a useful metric (wondering if this would be _overall_ or bounded to
>> > some time window?)
>> >
>> > - About the duration iterators are currently open, I believe the only
>> > useful way is to report the "oldest iterator", ie, the smallest iterator
>> > open-time, of all currently open-iterator? We all agree that in general,
>> > leaking iterator would bump the count metric, and if there is a few ones
>> > which are not closed and open for a long time, it seem sufficient to
>> > detect the single oldest one for alerting purpose?
>> >
>> > - What I don't like about the KIP is it focus on RocksDB. I don't think
>> > we should build on the internal RocksDB counters as proposed (I guess,
>> > we could still expose them, similar to other RocksDB metrics which we
>> > expose already). However, for this new metric, we should track it
>> > ourselves and thus make it independent of RocksDB -- in the end, an
>> > in-memory store could also leak memory (and kill a JVM with an
>> > out-of-memory error) and we should be able to track it.
>> >
>> > - Not sure if we would like to add support for custom stores, to allow
>> > them to register their iterators with this metric? Or would this not be
>> > necessary, because custom stores could just register a custom metric
>> > about it to begin with?
>> >
>> >
>> >
>> > -Matthias
>> >
>> > On 10/25/23 4:41 PM, Sophie Blee-Goldman wrote:
>> > >>
>> > >>   If we used "iterator-duration-max", for
>> > >> example, would it not be confusing that it includes Iterators that
>> are
>> > >> still open, and therefore the duration is not yet known?
>> > >
>> > >
>> > > 1. Ah, I think I understand your concern better now -- I totally agree
>> > that
>> > > a
>> > >   "iterator-duration-max" metric would be confusing/misleading. I was
>> > > thinking about it a bit differently, something more akin to the
>> > > "last-rebalance-seconds-ago" consumer metric. As the name suggests,
>> > > that basically just tracks how long the consumer has gone without
>> > > rebalancing -- it doesn't purport to represent the actual duration
>> > between
>> > > rebalances, just the current time since the last one.  The hard part
>> is
>> > > really
>> > > in choosing a name that reflects this -- maybe you have some better
>> ideas
>> > > but off the top of my head, perhaps something like
>> > "iterator-lifetime-max"?
>> > >
>> > > 2. I'm not quite sure how to interpret the "iterator-duration-total"
>> > metric
>> > > -- what exactly does it mean to add up all the iterator durations? For
>> > > some context, while this is not a hard-and-fast rule, in general
>> you'll
>> > > find that Kafka/Streams metrics tend to come in pairs of avg/max or
>> > > rate/total. Something that you might measure the avg for usually is
>> > > also useful to measure the max, whereas a total metric is probably
>> > > also useful as a rate but not so much as an avg. I actually think this
>> > > is part of why it feels like it makes so much sense to include a "max"
>> > > version of this metric, as Lucas suggested, even if the name of
>> > > "iterator-duration-max" feels misleading. Ultimately the metric names
>> > > are up to you, but for this reason, I would personally advocate for
>> > > just going with an "iterator-duration-avg" and "iterator-duration-max"
>> > >
>> > > I did see your example in which you mention one could monitor the
>> > > rate of change of the "-total" metric. While this does make sense to
>> > > me, if the only way to interpret a metric is by computing another
>> > > function over it, then why not just make that computation the metric
>> > > and cut out the middle man? And in this case, to me at least, it feels
>> > > much easier to understand a metric like "iterator-duration-max" vs
>> > > something like "iterator-duration-total-rate"
>> > >
>> > > 3. By the way, can you add another column to the table with the new
>> > metrics
>> > > that lists the recording level? My suggestion would be to put the
>> > > "number-open-iterators" at INFO and the other two at DEBUG. See
>> > > the following for my reasoning behind this recommendation
>> > >
>> > > 4. I would change the "Type" entry for the "number-open-iterators"
>> from
>> > > "Value" to "Gauge". This helps justify the "INFO" level for this
>> metric,
>> > > since unlike the other metrics which are "Measurables", the current
>> > > timestamp won't need to be retrieved on each recording
>> > >
>> > > 5. Can you list the tags that would be associated with each of these
>> > > metrics (either in the table, or separately above/below if they will
>> be
>> > > the same for all)
>> > >
>> > > 6. Do you have a strong preference for the name
>> "number-open-iterators"
>> > > or would you be alright in shortening this to "num-open-iterators"?
>> The
>> > > latter is more in line with the naming scheme used elsewhere in Kafka
>> > > for similar kinds of metrics, and a shorter name is always nice.
>> > >
>> > > 7. With respect to the rocksdb cache metrics, those sound useful but
>> > > if it was me, I would probably save them for a separate KIP mainly
>> just
>> > > because the KIP freeze deadline is in a few weeks, and I wouldn't want
>> > > to end up blocking all the new metrics just because there was ongoing
>> > > debate about a subset of them. That said, you do have 3 full weeks, so
>> > > I would hope that you could get both sets of metrics agreed upon in
>> > > that timeframe!
>> > >
>> > >
>> > > On Tue, Oct 24, 2023 at 6:35 AM Nick Telford <nick.telf...@gmail.com>
>> > wrote:
>> > >
>> > >> I don't really have a problem with adding such a metric, I'm just not
>> > >> entirely sure how it would work. If we used "iterator-duration-max",
>> for
>> > >> example, would it not be confusing that it includes Iterators that
>> are
>> > >> still open, and therefore the duration is not yet known? When
>> graphing
>> > that
>> > >> over time, I suspect it would be difficult to understand.
>> > >>
>> > >> 3.
>> > >> FWIW, this would still be picked up by "open-iterators", since that
>> > metric
>> > >> is only decremented when Iterator#close is called (via the
>> > >> ManagedKeyValueIterator#onClose hook).
>> > >>
>> > >> I'm actually considering expanding the scope of this KIP slightly to
>> > >> include improved Block Cache metrics, as my own memory leak
>> > investigations
>> > >> have trended in that direction. Do you think the following metrics
>> > should
>> > >> be included in this KIP, or should I create a new KIP?
>> > >>
>> > >>     - block-cache-index-usage (number of bytes occupied by index
>> blocks)
>> > >>     - block-cache-filter-usage (number of bytes occupied by filter
>> > blocks)
>> > >>
>> > >> Regards,
>> > >> Nick
>> > >>
>> > >> On Tue, 24 Oct 2023 at 07:09, Sophie Blee-Goldman <
>> > sop...@responsive.dev>
>> > >> wrote:
>> > >>
>> > >>> I actually think we could implement Lucas' suggestion pretty easily
>> and
>> > >>> without too much additional effort. We have full control over the
>> > >> iterator
>> > >>> that is returned by the various range queries, so it would be easy
>> to
>> > >>> register a gauge metric for how long it has been since the iterator
>> was
>> > >>> created. Then we just deregister the metric when the iterator is
>> > closed.
>> > >>>
>> > >>> With respect to how useful this metric would be, both Nick and Lucas
>> > have
>> > >>> made good points: I would agree that in general, leaking iterators
>> > would
>> > >>> mean an ever-increasing iterator count that should be possible to
>> spot
>> > >>> without this. However, a few things to consider:
>> > >>>
>> > >>> 1. it's really easy to set up an alert based on some maximum
>> threshold
>> > of
>> > >>> how long an iterator should remain open for. It's relatively more
>> > tricky
>> > >> to
>> > >>> set up alerts based on the current count of open iterators and how
>> it
>> > >>> changes over time.
>> > >>> 2. As Lucas mentioned, it only takes a few iterators to wreak havoc
>> in
>> > >>> extreme cases. Sometimes more advanced applications end up with
>> just a
>> > >> few
>> > >>> leaking iterators despite closing the majority of them. I've seen
>> this
>> > >>> happen just once personally, but it was driving everyone crazy
>> until we
>> > >>> figured it out.
>> > >>> 3. A metric for how long the iterator has been open would help to
>> > >> identify
>> > >>> hanging iterators due to some logic where the iterator is properly
>> > closed
>> > >>> but for whatever reason just isn't being advanced to the end, and
>> thus
>> > >> not
>> > >>> reached the iterator#close line of the user code. This case seems
>> > >> difficult
>> > >>> to spot without the specific metric for iterator lifetime
>> > >>> 4. Lastly, I think you could argue that all of the above are fairly
>> > >>> advanced use cases, but this seems like a fairly advanced feature
>> > >> already,
>> > >>> so it doesn't seem unreasonable to try and cover all the bases.
>> > >>>
>> > >>> All that said, my philosophy is that the KIP author gets the final
>> word
>> > >> on
>> > >>> what to pull into scope as long as the proposal isn't harming anyone
>> > >>> without the extra feature/changes. So it's up to you Nick --  just
>> > wanted
>> > >>> to add some context on how it could work, and why it would be
>> helpful
>> > >>>
>> > >>> Thanks for the KIP!
>> > >>>
>> > >>> On Wed, Oct 18, 2023 at 7:04 AM Lucas Brutschy
>> > >>> <lbruts...@confluent.io.invalid> wrote:
>> > >>>
>> > >>>> Hi Nick,
>> > >>>>
>> > >>>> I did not think in detail about how to implement it, just about
>> what
>> > >>>> metrics would be nice to have. You are right, we'd have to
>> > >>>> register/deregister the iterators during open/close. This would be
>> > >>>> more complicated to implement than the other metrics, but I do not
>> see
>> > >>>> a fundamental problem with it.
>> > >>>>
>> > >>>> As far as I understand, even a low number of leaked iterators can
>> hurt
>> > >>>> RocksDB compaction significantly. So we may even want to detect if
>> the
>> > >>>> iterators are opened by one-time / rare queries against the state
>> > >>>> store.
>> > >>>>
>> > >>>> But, as I said, that would be an addition and not a change of the
>> > >>>> current contents of the KIP, so I'd support the KIP moving forward
>> > >>>> even without this extension.
>> > >>>>
>> > >>>> Cheers, Lucas
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On Tue, Oct 17, 2023 at 3:45 PM Nick Telford <
>> nick.telf...@gmail.com>
>> > >>>> wrote:
>> > >>>>>
>> > >>>>> Hi Lucas,
>> > >>>>>
>> > >>>>> Hmm, I'm not sure how we could reliably identify such leaked
>> > >> Iterators.
>> > >>>> If
>> > >>>>> we tried to include open iterators when calculating
>> > >> iterator-duration,
>> > >>>> we'd
>> > >>>>> need some kind of registry of all the open iterator creation
>> > >>> timestamps,
>> > >>>>> wouldn't we?
>> > >>>>>
>> > >>>>> In general, if you have a leaky Iterator, it should manifest as a
>> > >>>>> persistently climbing "open-iterators" metric, even on a busy
>> node,
>> > >>>> because
>> > >>>>> each time that Iterator is used, it will leak another one. So
>> even in
>> > >>> the
>> > >>>>> presence of many non-leaky Iterators on a busy instance, the
>> metric
>> > >>>> should
>> > >>>>> still consistently climb.
>> > >>>>>
>> > >>>>> Regards,
>> > >>>>>
>> > >>>>> Nick
>> > >>>>>
>> > >>>>> On Mon, 16 Oct 2023 at 14:24, Lucas Brutschy <
>> lbruts...@confluent.io
>> > >>>> .invalid>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> Hi Nick!
>> > >>>>>>
>> > >>>>>> thanks for the KIP! I think this could be quite useful, given the
>> > >>>>>> problems that we had with leaks due to RocksDB resources not
>> being
>> > >>>>>> closed.
>> > >>>>>>
>> > >>>>>> I don't have any pressing issues why we can't accept it like it
>> is,
>> > >>>>>> just one minor point for discussion: would it also make sense to
>> > >> make
>> > >>>>>> it possible to identify a few very long-running / leaked
>> > >> iterators? I
>> > >>>>>> can imagine on a busy node, it would be hard to spot that 1% of
>> the
>> > >>>>>> iterators never close when looking only at closed iterator or the
>> > >>>>>> number of iterators. But it could still be good to identify those
>> > >>>>>> leaks early. One option would be to add `iterator-duration-max`
>> and
>> > >>>>>> take open iterators into account when computing the metric.
>> > >>>>>>
>> > >>>>>> Cheers,
>> > >>>>>> Lucas
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Thu, Oct 5, 2023 at 3:50 PM Nick Telford <
>> > >> nick.telf...@gmail.com>
>> > >>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>> Hi Colt,
>> > >>>>>>>
>> > >>>>>>> I kept the details out of the KIP, because KIPs generally
>> > >> document
>> > >>>>>>> high-level design, but the way I'm doing it is like this:
>> > >>>>>>>
>> > >>>>>>>          final ManagedKeyValueIterator<Bytes, byte[]>
>> > >>>>>>> rocksDbPrefixSeekIterator = cf.prefixScan(accessor,
>> prefixBytes);
>> > >>>>>>> -->     final long startedAt = System.nanoTime();
>> > >>>>>>>          openIterators.add(rocksDbPrefixSeekIterator);
>> > >>>>>>>          rocksDbPrefixSeekIterator.onClose(() -> {
>> > >>>>>>> -->
>> > >>>   metricsRecorder.recordIteratorDuration(System.nanoTime()
>> > >>>> -
>> > >>>>>>> startedAt);
>> > >>>>>>>              openIterators.remove(rocksDbPrefixSeekIterator);
>> > >>>>>>>          });
>> > >>>>>>>
>> > >>>>>>> The lines with the arrow are the new code. This pattern is
>> > >> repeated
>> > >>>>>>> throughout RocksDBStore, wherever a new RocksDbIterator is
>> > >> created.
>> > >>>>>>>
>> > >>>>>>> Regards,
>> > >>>>>>> Nick
>> > >>>>>>>
>> > >>>>>>> On Thu, 5 Oct 2023 at 12:32, Colt McNealy <c...@littlehorse.io>
>> > >>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> Thank you for the KIP, Nick!
>> > >>>>>>>>
>> > >>>>>>>> This would be highly useful for many reasons. Much more sane
>> > >> than
>> > >>>>>> checking
>> > >>>>>>>> for leaked iterators by profiling memory usage while running
>> > >>> 100's
>> > >>>> of
>> > >>>>>>>> thousands of range scans via interactive queries (:
>> > >>>>>>>>
>> > >>>>>>>> One small question:
>> > >>>>>>>>
>> > >>>>>>>>> The iterator-duration metrics will be updated whenever an
>> > >>>> Iterator's
>> > >>>>>>>> close() method is called
>> > >>>>>>>>
>> > >>>>>>>> Does the Iterator have its own "createdAt()" or equivalent
>> > >> field,
>> > >>>> or
>> > >>>>>> do we
>> > >>>>>>>> need to keep track of the Iterator's start time upon creation?
>> > >>>>>>>>
>> > >>>>>>>> Cheers,
>> > >>>>>>>> Colt McNealy
>> > >>>>>>>>
>> > >>>>>>>> *Founder, LittleHorse.dev*
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Wed, Oct 4, 2023 at 9:07 AM Nick Telford <
>> > >>>> nick.telf...@gmail.com>
>> > >>>>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> Hi everyone,
>> > >>>>>>>>>
>> > >>>>>>>>> KIP-989 is a small Kafka Streams KIP to add a few new metrics
>> > >>>> around
>> > >>>>>> the
>> > >>>>>>>>> creation and use of RocksDB Iterators, to aid users in
>> > >>>> identifying
>> > >>>>>>>>> "Iterator leaks" that could cause applications to leak native
>> > >>>> memory.
>> > >>>>>>>>>
>> > >>>>>>>>> Let me know what you think!
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>
>> > >>>>
>> > >>>
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics
>> > >>>>>>>>>
>> > >>>>>>>>> P.S. I'm not too sure about the formatting of the "New
>> > >> Metrics"
>> > >>>>>> table,
>> > >>>>>>>> any
>> > >>>>>>>>> advice there would be appreciated.
>> > >>>>>>>>>
>> > >>>>>>>>> Regards,
>> > >>>>>>>>> Nick
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>
>> > >>>>
>> > >>>
>> > >>
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

Reply via email to