Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

Matthias J. Sax Sun, 31 Mar 2024 02:53:21 -0700

The time window thing was just an idea. Happy to drop it.

For the oldest iterator metric, I would propose something simple like`iterator-opened-ms` and it would just be the actual timestamp when theiterator was opened. I don't think we need to compute the actual age,but user can to this computation themselves?

If we think reporting the age instead of just the timestamp is better, Iwould propose `iterator-max-age-ms`. I should be sufficient to call out(as it's kinda "obvious" anyway) that the metric applies to openiterator only.

And yes, I was hoping that the code inside MetereXxxStore might alreadybe setup in a way that custom stores would inherit the iterator metricsautomatically -- I am just not sure, and left it as an exercise forsomebody to confirm :)

Nit: the KIP says it's a store-level metric, but I think it would begood to say explicitly that it's recorded with DEBUG level only?




-Matthias


On 3/28/24 2:52 PM, Nick Telford wrote:

Quick addendum:

My suggested metric "oldest-open-iterator-age-seconds" should be
"oldest-open-iterator-age-ms". Milliseconds is obviously a better
granularity for such a metric.

Still accepting suggestions for a better name.

On Thu, 28 Mar 2024 at 13:41, Nick Telford <nick.telf...@gmail.com> wrote:

Hi everyone,

Sorry for leaving this for so long. So much for "3 weeks until KIP freeze"!

On Sophie's comments:
1. Would Matthias's suggestion of a separate metric tracking the age of
the oldest open iterator (within the tag set) satisfy this? That way we can
keep iterator-duration-(avg|max) for closed iterators, which can be useful
for performance debugging for iterators that don't leak. I'm not sure what
we'd call this metric, maybe: "oldest-open-iterator-age-seconds"? Seems
like a mouthful.

2. You're right, it makes more sense to provide
iterator-duration-(avg|max). Honestly, I can't remember why I had "total"
before, or why I was computing a rate-of-change over it.

3, 4, 5, 6. Agreed, I'll make all those changes as suggested.

7. Combined with Matthias's point about RocksDB, I'm convinced that this
is the wrong KIP for these. I'll introduce the additional Rocks metrics in
another KIP.

On Matthias's comments:
A. Not sure about the time window. I'm pretty sure all existing avg/max
metrics are since the application was started? Any other suggestions here
would be appreciated.

B. Agreed. See point 1 above.

C. Good point. My focus was very much on Rocks memory leaks when I wrote
the first draft. I can generalise it. My only concern is that it might make
it more difficult to detect Rocks iterator leaks caused *within* our
high-level iterator, e.g. RocksJNI, RocksDB, RocksDBStore, etc. But we
could always provide a RocksDB-specific metric for this, as you suggested.

D. Hmm, we do already have MeteredKeyValueIterator, which automatically
wraps the iterator from inner-stores of MeteredKeyValueStore. If we
implemented these metrics there, then custom stores would automatically
gain the functionality, right? This seems like a pretty logical place to
implement these metrics, since MeteredKeyValueStore is all about adding
metrics to state stores.

I imagine the best way to implement this would be to do so at the
high-level iterator rather than implementing it separately for each
specific iterator implementation for every store type.


Sophie, does MeteredKeyValueIterator fit with your recommendation?

Thanks for your thoughts everyone, I'll update the KIP now.

Nick

On Thu, 14 Mar 2024 at 03:37, Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

About your last two points: I completely agree that we should try to
make this independent of RocksDB, and should probably adopt a
general philosophy of being store-implementation agnostic unless
there is good reason to focus on a particular store type: eg if it was
only possible to implement for certain stores, or only made sense in
the context of a certain store type but not necessarily stores in general.

While leaking memory due to unclosed iterators on RocksDB stores is
certainly the most common issue, I think Matthias sufficiently
demonstrated that the problem of leaking iterators is not actually
unique to RocksDB, and we should consider including in-memory
stores at the very least. I also think that at this point, we may as well
just implement the metrics for *all* store types, whether rocksdb or
in-memory or custom. Not just because it probably applies to all
store types (leaking iterators are rarely a good thing!) but because
I imagine the best way to implement this would be to do so at the
high-level iterator rather than implementing it separately for each
specific iterator implementation for every store type.

That said, I haven't thought all that carefully about the implementation
yet -- it just strikes me as easiest to do at the top level of the store
hierarchy rather than at the bottom. My gut instinct may very well be
wrong, but that's what it's saying

On Thu, Mar 7, 2024 at 10:43 AM Matthias J. Sax <mj...@apache.org> wrote:

Seems I am late to this party. Can we pick this up again aiming for 3.8
release? I think it would be a great addition. Few comments:


- I think it does make sense to report `iterator-duration-avg` and
`iterator-duration-max` for all *closed* iterators -- it just seems to
be a useful metric (wondering if this would be _overall_ or bounded to
some time window?)

- About the duration iterators are currently open, I believe the only
useful way is to report the "oldest iterator", ie, the smallest iterator
open-time, of all currently open-iterator? We all agree that in general,
leaking iterator would bump the count metric, and if there is a few ones
which are not closed and open for a long time, it seem sufficient to
detect the single oldest one for alerting purpose?

- What I don't like about the KIP is it focus on RocksDB. I don't think
we should build on the internal RocksDB counters as proposed (I guess,
we could still expose them, similar to other RocksDB metrics which we
expose already). However, for this new metric, we should track it
ourselves and thus make it independent of RocksDB -- in the end, an
in-memory store could also leak memory (and kill a JVM with an
out-of-memory error) and we should be able to track it.

- Not sure if we would like to add support for custom stores, to allow
them to register their iterators with this metric? Or would this not be
necessary, because custom stores could just register a custom metric
about it to begin with?



-Matthias

On 10/25/23 4:41 PM, Sophie Blee-Goldman wrote:


   If we used "iterator-duration-max", for
example, would it not be confusing that it includes Iterators that

are

still open, and therefore the duration is not yet known?



1. Ah, I think I understand your concern better now -- I totally agree

that

a
   "iterator-duration-max" metric would be confusing/misleading. I was
thinking about it a bit differently, something more akin to the
"last-rebalance-seconds-ago" consumer metric. As the name suggests,
that basically just tracks how long the consumer has gone without
rebalancing -- it doesn't purport to represent the actual duration

between

rebalances, just the current time since the last one.  The hard part

is

really
in choosing a name that reflects this -- maybe you have some better

ideas

but off the top of my head, perhaps something like

"iterator-lifetime-max"?


2. I'm not quite sure how to interpret the "iterator-duration-total"

metric

-- what exactly does it mean to add up all the iterator durations? For
some context, while this is not a hard-and-fast rule, in general

you'll

find that Kafka/Streams metrics tend to come in pairs of avg/max or
rate/total. Something that you might measure the avg for usually is
also useful to measure the max, whereas a total metric is probably
also useful as a rate but not so much as an avg. I actually think this
is part of why it feels like it makes so much sense to include a "max"
version of this metric, as Lucas suggested, even if the name of
"iterator-duration-max" feels misleading. Ultimately the metric names
are up to you, but for this reason, I would personally advocate for
just going with an "iterator-duration-avg" and "iterator-duration-max"

I did see your example in which you mention one could monitor the
rate of change of the "-total" metric. While this does make sense to
me, if the only way to interpret a metric is by computing another
function over it, then why not just make that computation the metric
and cut out the middle man? And in this case, to me at least, it feels
much easier to understand a metric like "iterator-duration-max" vs
something like "iterator-duration-total-rate"

3. By the way, can you add another column to the table with the new

metrics

that lists the recording level? My suggestion would be to put the
"number-open-iterators" at INFO and the other two at DEBUG. See
the following for my reasoning behind this recommendation

4. I would change the "Type" entry for the "number-open-iterators"

from

"Value" to "Gauge". This helps justify the "INFO" level for this

metric,

since unlike the other metrics which are "Measurables", the current
timestamp won't need to be retrieved on each recording

5. Can you list the tags that would be associated with each of these
metrics (either in the table, or separately above/below if they will

be

the same for all)

6. Do you have a strong preference for the name

"number-open-iterators"

or would you be alright in shortening this to "num-open-iterators"?

The

latter is more in line with the naming scheme used elsewhere in Kafka
for similar kinds of metrics, and a shorter name is always nice.

7. With respect to the rocksdb cache metrics, those sound useful but
if it was me, I would probably save them for a separate KIP mainly

just

because the KIP freeze deadline is in a few weeks, and I wouldn't want
to end up blocking all the new metrics just because there was ongoing
debate about a subset of them. That said, you do have 3 full weeks, so
I would hope that you could get both sets of metrics agreed upon in
that timeframe!


On Tue, Oct 24, 2023 at 6:35 AM Nick Telford <nick.telf...@gmail.com>

wrote:

I don't really have a problem with adding such a metric, I'm just not
entirely sure how it would work. If we used "iterator-duration-max",

for

example, would it not be confusing that it includes Iterators that

are

still open, and therefore the duration is not yet known? When

graphing

that

over time, I suspect it would be difficult to understand.

3.
FWIW, this would still be picked up by "open-iterators", since that

metric

is only decremented when Iterator#close is called (via the
ManagedKeyValueIterator#onClose hook).

I'm actually considering expanding the scope of this KIP slightly to
include improved Block Cache metrics, as my own memory leak

investigations

have trended in that direction. Do you think the following metrics

should

be included in this KIP, or should I create a new KIP?

     - block-cache-index-usage (number of bytes occupied by index

blocks)

     - block-cache-filter-usage (number of bytes occupied by filter

blocks)


Regards,
Nick

On Tue, 24 Oct 2023 at 07:09, Sophie Blee-Goldman <

sop...@responsive.dev>

wrote:

I actually think we could implement Lucas' suggestion pretty easily

and

without too much additional effort. We have full control over the

iterator

that is returned by the various range queries, so it would be easy

to

register a gauge metric for how long it has been since the iterator

was

created. Then we just deregister the metric when the iterator is

closed.


With respect to how useful this metric would be, both Nick and Lucas

have

made good points: I would agree that in general, leaking iterators

would

mean an ever-increasing iterator count that should be possible to

spot

without this. However, a few things to consider:

1. it's really easy to set up an alert based on some maximum

threshold

of

how long an iterator should remain open for. It's relatively more

tricky

to

set up alerts based on the current count of open iterators and how

it

changes over time.
2. As Lucas mentioned, it only takes a few iterators to wreak havoc

in

extreme cases. Sometimes more advanced applications end up with

just a

few

leaking iterators despite closing the majority of them. I've seen

this

happen just once personally, but it was driving everyone crazy

until we

figured it out.
3. A metric for how long the iterator has been open would help to

identify

hanging iterators due to some logic where the iterator is properly

closed

but for whatever reason just isn't being advanced to the end, and

thus

not

reached the iterator#close line of the user code. This case seems

difficult

to spot without the specific metric for iterator lifetime
4. Lastly, I think you could argue that all of the above are fairly
advanced use cases, but this seems like a fairly advanced feature

already,

so it doesn't seem unreasonable to try and cover all the bases.

All that said, my philosophy is that the KIP author gets the final

word

on

what to pull into scope as long as the proposal isn't harming anyone
without the extra feature/changes. So it's up to you Nick --  just

wanted

to add some context on how it could work, and why it would be

helpful


Thanks for the KIP!

On Wed, Oct 18, 2023 at 7:04 AM Lucas Brutschy
<lbruts...@confluent.io.invalid> wrote:

Hi Nick,

I did not think in detail about how to implement it, just about

what

metrics would be nice to have. You are right, we'd have to
register/deregister the iterators during open/close. This would be
more complicated to implement than the other metrics, but I do not

see

a fundamental problem with it.

As far as I understand, even a low number of leaked iterators can

hurt

RocksDB compaction significantly. So we may even want to detect if

the

iterators are opened by one-time / rare queries against the state
store.

But, as I said, that would be an addition and not a change of the
current contents of the KIP, so I'd support the KIP moving forward
even without this extension.

Cheers, Lucas



On Tue, Oct 17, 2023 at 3:45 PM Nick Telford <

nick.telf...@gmail.com>

wrote:


Hi Lucas,

Hmm, I'm not sure how we could reliably identify such leaked

Iterators.

If

we tried to include open iterators when calculating

iterator-duration,

we'd

need some kind of registry of all the open iterator creation

timestamps,

wouldn't we?

In general, if you have a leaky Iterator, it should manifest as a
persistently climbing "open-iterators" metric, even on a busy

node,

because

each time that Iterator is used, it will leak another one. So

even in

the

presence of many non-leaky Iterators on a busy instance, the

metric

should

still consistently climb.

Regards,

Nick

On Mon, 16 Oct 2023 at 14:24, Lucas Brutschy <

lbruts...@confluent.io

.invalid>

wrote:

Hi Nick!

thanks for the KIP! I think this could be quite useful, given the
problems that we had with leaks due to RocksDB resources not

being

closed.

I don't have any pressing issues why we can't accept it like it

is,

just one minor point for discussion: would it also make sense to

make

it possible to identify a few very long-running / leaked

iterators? I

can imagine on a busy node, it would be hard to spot that 1% of

the

iterators never close when looking only at closed iterator or the
number of iterators. But it could still be good to identify those
leaks early. One option would be to add `iterator-duration-max`

and

take open iterators into account when computing the metric.

Cheers,
Lucas


On Thu, Oct 5, 2023 at 3:50 PM Nick Telford <

nick.telf...@gmail.com>

wrote:


Hi Colt,

I kept the details out of the KIP, because KIPs generally

document

high-level design, but the way I'm doing it is like this:

          final ManagedKeyValueIterator<Bytes, byte[]>
rocksDbPrefixSeekIterator = cf.prefixScan(accessor,

prefixBytes);

-->     final long startedAt = System.nanoTime();
          openIterators.add(rocksDbPrefixSeekIterator);
          rocksDbPrefixSeekIterator.onClose(() -> {
-->

   metricsRecorder.recordIteratorDuration(System.nanoTime()

startedAt);
              openIterators.remove(rocksDbPrefixSeekIterator);
          });

The lines with the arrow are the new code. This pattern is

repeated

throughout RocksDBStore, wherever a new RocksDbIterator is

created.


Regards,
Nick

On Thu, 5 Oct 2023 at 12:32, Colt McNealy <c...@littlehorse.io>

wrote:

Thank you for the KIP, Nick!

This would be highly useful for many reasons. Much more sane

than

checking

for leaked iterators by profiling memory usage while running

100's

of

thousands of range scans via interactive queries (:

One small question:

The iterator-duration metrics will be updated whenever an

Iterator's

close() method is called

Does the Iterator have its own "createdAt()" or equivalent

field,

or

do we

need to keep track of the Iterator's start time upon creation?

Cheers,
Colt McNealy

*Founder, LittleHorse.dev*


On Wed, Oct 4, 2023 at 9:07 AM Nick Telford <

nick.telf...@gmail.com>

wrote:

Hi everyone,

KIP-989 is a small Kafka Streams KIP to add a few new metrics

around

the

creation and use of RocksDB Iterators, to aid users in

identifying

"Iterator leaks" that could cause applications to leak native

memory.


Let me know what you think!

https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics


P.S. I'm not too sure about the formatting of the "New

Metrics"

table,

any

advice there would be appreciated.

Regards,
Nick

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

Reply via email to