Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

Matthias J. Sax Thu, 16 May 2024 11:31:04 -0700

SGTM to use nano-seconds across the board.

On 5/16/24 7:12 AM, Nick Telford wrote:

Actually, one other point: our existing state store operation metrics are
measured in nanoseconds[1].


Should iterator-duration-(avg|max) also be measured in nanoseconds, for
consistency, or should we keep them milliseconds, as the KIP currently
states?

1:
https://docs.confluent.io/platform/current/streams/monitoring.html#state-store-metrics

On Thu, 16 May 2024 at 12:15, Nick Telford <nick.telf...@gmail.com> wrote:

Good point! I've updated it to "Improved StateStore Iterator metrics for
detecting leaks" - let me know if you have a better suggestion.

This should affect the voting imo, as nothing of substance has changed.

Regards,
Nick

On Thu, 16 May 2024 at 01:39, Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

One quick thing -- can you update the title of this KIP to reflect the
decision to implement these metrics for all state stores implementations
rather than just RocksDB?


On Tue, May 14, 2024 at 1:36 PM Nick Telford <nick.telf...@gmail.com>
wrote:

Woops! Thanks for the catch Lucas. Given this was just a typo, I don't
think this affects the voting.

Cheers,
Nick

On Tue, 14 May 2024 at 18:06, Lucas Brutschy <lbruts...@confluent.io
.invalid>
wrote:

Hi Nick,

you are still referring to oldest-open-iterator-age-ms in the
`Proposed Changes` section.

Cheers,
Lucas

On Thu, May 2, 2024 at 4:00 PM Lucas Brutschy <lbruts...@confluent.io

wrote:


Hi Nick!

I agree, the age variant is a bit nicer since the semantics are very
clear from the name. If you'd rather go for the simple

implementation,

how about calling it `oldest-iterator-open-since-ms`? I believe this
could be understood without docs. Either way, I think we should be
able to open the vote for this KIP because nobody raised any major /
blocking concerns.

Looking forward to getting this voted on soon!

Cheers
Lucas

On Sun, Mar 31, 2024 at 5:23 PM Nick Telford <

nick.telf...@gmail.com>

wrote:


Hi Matthias,

For the oldest iterator metric, I would propose something simple

like

`iterator-opened-ms` and it would just be the actual timestamp

when

the

iterator was opened. I don't think we need to compute the actual

age,

but user can to this computation themselves?


That works for me; it's easier to implement like that :-D I'm a

little

concerned that the name "iterator-opened-ms" may not be obvious

enough

without reading the docs.

If we think reporting the age instead of just the timestamp is

better, I

would propose `iterator-max-age-ms`. I should be sufficient to

call

out

(as it's kinda "obvious" anyway) that the metric applies to open
iterator only.


While I think it's preferable to record the timestamp, rather than

the

age,

this does have the benefit of a more obvious metric name.

Nit: the KIP says it's a store-level metric, but I think it

would

be

good to say explicitly that it's recorded with DEBUG level only?


Yes, I've already updated the KIP with this information in the

table.


Regards,

Nick

On Sun, 31 Mar 2024 at 10:53, Matthias J. Sax <mj...@apache.org>

wrote:

The time window thing was just an idea. Happy to drop it.

For the oldest iterator metric, I would propose something simple

like

`iterator-opened-ms` and it would just be the actual timestamp

when

the

iterator was opened. I don't think we need to compute the actual

age,

but user can to this computation themselves?

If we think reporting the age instead of just the timestamp is

better, I

would propose `iterator-max-age-ms`. I should be sufficient to

call

out

(as it's kinda "obvious" anyway) that the metric applies to open
iterator only.

And yes, I was hoping that the code inside MetereXxxStore might

already

be setup in a way that custom stores would inherit the iterator

metrics

automatically -- I am just not sure, and left it as an exercise

for

somebody to confirm :)


Nit: the KIP says it's a store-level metric, but I think it

would

be

good to say explicitly that it's recorded with DEBUG level only?



-Matthias


On 3/28/24 2:52 PM, Nick Telford wrote:

Quick addendum:

My suggested metric "oldest-open-iterator-age-seconds" should

be

"oldest-open-iterator-age-ms". Milliseconds is obviously a

better

granularity for such a metric.

Still accepting suggestions for a better name.

On Thu, 28 Mar 2024 at 13:41, Nick Telford <

nick.telf...@gmail.com

wrote:

Hi everyone,

Sorry for leaving this for so long. So much for "3 weeks

until

KIP

freeze"!


On Sophie's comments:
1. Would Matthias's suggestion of a separate metric tracking

the

age of

the oldest open iterator (within the tag set) satisfy this?

That

way we

can

keep iterator-duration-(avg|max) for closed iterators, which

can

be

useful

for performance debugging for iterators that don't leak. I'm

not

sure

what

we'd call this metric, maybe:

"oldest-open-iterator-age-seconds"?

Seems

like a mouthful.

2. You're right, it makes more sense to provide
iterator-duration-(avg|max). Honestly, I can't remember why I

had

"total"

before, or why I was computing a rate-of-change over it.

3, 4, 5, 6. Agreed, I'll make all those changes as suggested.

7. Combined with Matthias's point about RocksDB, I'm

convinced

that this

is the wrong KIP for these. I'll introduce the additional

Rocks

metrics

in

another KIP.

On Matthias's comments:
A. Not sure about the time window. I'm pretty sure all

existing

avg/max

metrics are since the application was started? Any other

suggestions

here

would be appreciated.

B. Agreed. See point 1 above.

C. Good point. My focus was very much on Rocks memory leaks

when

I wrote

the first draft. I can generalise it. My only concern is

that it

might

make

it more difficult to detect Rocks iterator leaks caused

*within*

our

high-level iterator, e.g. RocksJNI, RocksDB, RocksDBStore,

etc.

But we

could always provide a RocksDB-specific metric for this, as

you

suggested.


D. Hmm, we do already have MeteredKeyValueIterator, which

automatically

wraps the iterator from inner-stores of

MeteredKeyValueStore. If

we

implemented these metrics there, then custom stores would

automatically

gain the functionality, right? This seems like a pretty

logical

place to

implement these metrics, since MeteredKeyValueStore is all

about

adding

metrics to state stores.

I imagine the best way to implement this would be to do so

at

the

high-level iterator rather than implementing it separately

for

each

specific iterator implementation for every store type.


Sophie, does MeteredKeyValueIterator fit with your

recommendation?


Thanks for your thoughts everyone, I'll update the KIP now.

Nick

On Thu, 14 Mar 2024 at 03:37, Sophie Blee-Goldman <

sop...@responsive.dev>

wrote:

About your last two points: I completely agree that we

should

try to

make this independent of RocksDB, and should probably adopt

general philosophy of being store-implementation agnostic

unless

there is good reason to focus on a particular store type:

eg if

it was

only possible to implement for certain stores, or only made

sense in

the context of a certain store type but not necessarily

stores

in

general.


While leaking memory due to unclosed iterators on RocksDB

stores

is

certainly the most common issue, I think Matthias

sufficiently

demonstrated that the problem of leaking iterators is not

actually

unique to RocksDB, and we should consider including

in-memory

stores at the very least. I also think that at this point,

we

may as

well

just implement the metrics for *all* store types, whether

rocksdb or

in-memory or custom. Not just because it probably applies to

all

store types (leaking iterators are rarely a good thing!) but

because

I imagine the best way to implement this would be to do so

at

the

high-level iterator rather than implementing it separately

for

each

specific iterator implementation for every store type.

That said, I haven't thought all that carefully about the

implementation

yet -- it just strikes me as easiest to do at the top level

of

the

store

hierarchy rather than at the bottom. My gut instinct may

very

well be

wrong, but that's what it's saying

On Thu, Mar 7, 2024 at 10:43 AM Matthias J. Sax <

mj...@apache.org>

wrote:

Seems I am late to this party. Can we pick this up again

aiming

for

3.8

release? I think it would be a great addition. Few

comments:



- I think it does make sense to report

`iterator-duration-avg`

and

`iterator-duration-max` for all *closed* iterators -- it

just

seems to

be a useful metric (wondering if this would be _overall_ or

bounded to

some time window?)

- About the duration iterators are currently open, I

believe

the only

useful way is to report the "oldest iterator", ie, the

smallest

iterator

open-time, of all currently open-iterator? We all agree

that

in

general,

leaking iterator would bump the count metric, and if there

is

few

ones

which are not closed and open for a long time, it seem

sufficient to

detect the single oldest one for alerting purpose?

- What I don't like about the KIP is it focus on RocksDB. I

don't

think

we should build on the internal RocksDB counters as

proposed

(I

guess,

we could still expose them, similar to other RocksDB

metrics

which we

expose already). However, for this new metric, we should

track

it

ourselves and thus make it independent of RocksDB -- in the

end, an

in-memory store could also leak memory (and kill a JVM

with an

out-of-memory error) and we should be able to track it.

- Not sure if we would like to add support for custom

stores,

to allow

them to register their iterators with this metric? Or would

this not

be

necessary, because custom stores could just register a

custom

metric

about it to begin with?



-Matthias

On 10/25/23 4:41 PM, Sophie Blee-Goldman wrote:


    If we used "iterator-duration-max", for
example, would it not be confusing that it includes

Iterators

that

are

still open, and therefore the duration is not yet known?



1. Ah, I think I understand your concern better now -- I

totally

agree

that

a
    "iterator-duration-max" metric would be

confusing/misleading. I

was

thinking about it a bit differently, something more akin

to

the

"last-rebalance-seconds-ago" consumer metric. As the name

suggests,

that basically just tracks how long the consumer has gone

without

rebalancing -- it doesn't purport to represent the actual

duration

between

rebalances, just the current time since the last one.  The

hard part

is

really
in choosing a name that reflects this -- maybe you have

some

better

ideas

but off the top of my head, perhaps something like

"iterator-lifetime-max"?


2. I'm not quite sure how to interpret the

"iterator-duration-total"

metric

-- what exactly does it mean to add up all the iterator

durations?

For

some context, while this is not a hard-and-fast rule, in

general

you'll

find that Kafka/Streams metrics tend to come in pairs of

avg/max or

rate/total. Something that you might measure the avg for

usually is

also useful to measure the max, whereas a total metric is

probably

also useful as a rate but not so much as an avg. I

actually

think

this

is part of why it feels like it makes so much sense to

include

"max"

version of this metric, as Lucas suggested, even if the

name

of

"iterator-duration-max" feels misleading. Ultimately the

metric names

are up to you, but for this reason, I would personally

advocate for

just going with an "iterator-duration-avg" and

"iterator-duration-max"


I did see your example in which you mention one could

monitor

the

rate of change of the "-total" metric. While this does

make

sense to

me, if the only way to interpret a metric is by computing

another

function over it, then why not just make that computation

the

metric

and cut out the middle man? And in this case, to me at

least,

it

feels

much easier to understand a metric like

"iterator-duration-max" vs

something like "iterator-duration-total-rate"

3. By the way, can you add another column to the table

with

the new

metrics

that lists the recording level? My suggestion would be to

put

the

"number-open-iterators" at INFO and the other two at

DEBUG.

See

the following for my reasoning behind this recommendation

4. I would change the "Type" entry for the

"number-open-iterators"

from

"Value" to "Gauge". This helps justify the "INFO" level

for

this

metric,

since unlike the other metrics which are "Measurables",

the

current

timestamp won't need to be retrieved on each recording

5. Can you list the tags that would be associated with

each

of

these

metrics (either in the table, or separately above/below if

they will

be

the same for all)

6. Do you have a strong preference for the name

"number-open-iterators"

or would you be alright in shortening this to

"num-open-iterators"?

The

latter is more in line with the naming scheme used

elsewhere

in Kafka

for similar kinds of metrics, and a shorter name is always

nice.


7. With respect to the rocksdb cache metrics, those sound

useful but

if it was me, I would probably save them for a separate

KIP

mainly

just

because the KIP freeze deadline is in a few weeks, and I

wouldn't

want

to end up blocking all the new metrics just because there

was

ongoing

debate about a subset of them. That said, you do have 3

full

weeks,

so

I would hope that you could get both sets of metrics

agreed

upon in

that timeframe!


On Tue, Oct 24, 2023 at 6:35 AM Nick Telford <

nick.telf...@gmail.com

wrote:

I don't really have a problem with adding such a metric,

I'm

just

not

entirely sure how it would work. If we used

"iterator-duration-max",

for

example, would it not be confusing that it includes

Iterators

that

are

still open, and therefore the duration is not yet known?

When

graphing

that

over time, I suspect it would be difficult to understand.

3.
FWIW, this would still be picked up by "open-iterators",

since that

metric

is only decremented when Iterator#close is called (via

the

ManagedKeyValueIterator#onClose hook).

I'm actually considering expanding the scope of this KIP

slightly to

include improved Block Cache metrics, as my own memory

leak

investigations

have trended in that direction. Do you think the

following

metrics

should

be included in this KIP, or should I create a new KIP?

      - block-cache-index-usage (number of bytes occupied

by

index

blocks)

      - block-cache-filter-usage (number of bytes

occupied by

filter

blocks)


Regards,
Nick

On Tue, 24 Oct 2023 at 07:09, Sophie Blee-Goldman <

sop...@responsive.dev>

wrote:

I actually think we could implement Lucas' suggestion

pretty

easily

and

without too much additional effort. We have full control

over the

iterator

that is returned by the various range queries, so it

would

be easy

to

register a gauge metric for how long it has been since

the

iterator

was

created. Then we just deregister the metric when the

iterator is

closed.


With respect to how useful this metric would be, both

Nick

and

Lucas

have

made good points: I would agree that in general, leaking

iterators

would

mean an ever-increasing iterator count that should be

possible to

spot

without this. However, a few things to consider:

1. it's really easy to set up an alert based on some

maximum

threshold

of

how long an iterator should remain open for. It's

relatively

more

tricky

to

set up alerts based on the current count of open

iterators

and how

it

changes over time.
2. As Lucas mentioned, it only takes a few iterators to

wreak havoc

in

extreme cases. Sometimes more advanced applications end

up

with

just a

few

leaking iterators despite closing the majority of them.

I've

seen

this

happen just once personally, but it was driving everyone

crazy

until we

figured it out.
3. A metric for how long the iterator has been open

would

help to

identify

hanging iterators due to some logic where the iterator

is

properly

closed

but for whatever reason just isn't being advanced to the

end, and

thus

not

reached the iterator#close line of the user code. This

case

seems

difficult

to spot without the specific metric for iterator

lifetime

4. Lastly, I think you could argue that all of the above

are

fairly

advanced use cases, but this seems like a fairly

advanced

feature

already,

so it doesn't seem unreasonable to try and cover all the

bases.


All that said, my philosophy is that the KIP author gets

the

final

word

on

what to pull into scope as long as the proposal isn't

harming

anyone

without the extra feature/changes. So it's up to you

Nick

--  just

wanted

to add some context on how it could work, and why it

would

be

helpful


Thanks for the KIP!

On Wed, Oct 18, 2023 at 7:04 AM Lucas Brutschy
<lbruts...@confluent.io.invalid> wrote:

Hi Nick,

I did not think in detail about how to implement it,

just

about

what

metrics would be nice to have. You are right, we'd

have to

register/deregister the iterators during open/close.

This

would be

more complicated to implement than the other metrics,

but

do not

see

a fundamental problem with it.

As far as I understand, even a low number of leaked

iterators can

hurt

RocksDB compaction significantly. So we may even want

to

detect if

the

iterators are opened by one-time / rare queries against

the

state

store.

But, as I said, that would be an addition and not a

change

of the

current contents of the KIP, so I'd support the KIP

moving

forward

even without this extension.

Cheers, Lucas



On Tue, Oct 17, 2023 at 3:45 PM Nick Telford <

nick.telf...@gmail.com>

wrote:


Hi Lucas,

Hmm, I'm not sure how we could reliably identify such

leaked

Iterators.

If

we tried to include open iterators when calculating

iterator-duration,

we'd

need some kind of registry of all the open iterator

creation

timestamps,

wouldn't we?

In general, if you have a leaky Iterator, it should

manifest as a

persistently climbing "open-iterators" metric, even

on a

busy

node,

because

each time that Iterator is used, it will leak another

one.

So

even in

the

presence of many non-leaky Iterators on a busy

instance,

the

metric

should

still consistently climb.

Regards,

Nick

On Mon, 16 Oct 2023 at 14:24, Lucas Brutschy <

lbruts...@confluent.io

.invalid>

wrote:

Hi Nick!

thanks for the KIP! I think this could be quite

useful,

given

the

problems that we had with leaks due to RocksDB

resources

not

being

closed.

I don't have any pressing issues why we can't accept

it

like it

is,

just one minor point for discussion: would it also

make

sense to

make

it possible to identify a few very long-running /

leaked

iterators? I

can imagine on a busy node, it would be hard to spot

that

1% of

the

iterators never close when looking only at closed

iterator or

the

number of iterators. But it could still be good to

identify

those

leaks early. One option would be to add

`iterator-duration-max`

and

take open iterators into account when computing the

metric.


Cheers,
Lucas


On Thu, Oct 5, 2023 at 3:50 PM Nick Telford <

nick.telf...@gmail.com>

wrote:


Hi Colt,

I kept the details out of the KIP, because KIPs

generally

document

high-level design, but the way I'm doing it is like

this:


           final ManagedKeyValueIterator<Bytes,

byte[]>

rocksDbPrefixSeekIterator = cf.prefixScan(accessor,

prefixBytes);

-->     final long startedAt = System.nanoTime();

  openIterators.add(rocksDbPrefixSeekIterator);

           rocksDbPrefixSeekIterator.onClose(() -> {
-->

metricsRecorder.recordIteratorDuration(System.nanoTime()

startedAt);

  openIterators.remove(rocksDbPrefixSeekIterator);

           });

The lines with the arrow are the new code. This

pattern

is

repeated

throughout RocksDBStore, wherever a new

RocksDbIterator

is

created.


Regards,
Nick

On Thu, 5 Oct 2023 at 12:32, Colt McNealy <

c...@littlehorse.io

wrote:

Thank you for the KIP, Nick!

This would be highly useful for many reasons. Much

more

sane

than

checking

for leaked iterators by profiling memory usage

while

running

100's

of

thousands of range scans via interactive queries (:

One small question:

The iterator-duration metrics will be updated

whenever

an

Iterator's

close() method is called

Does the Iterator have its own "createdAt()" or

equivalent

field,

or

do we

need to keep track of the Iterator's start time

upon

creation?


Cheers,
Colt McNealy

*Founder, LittleHorse.dev*


On Wed, Oct 4, 2023 at 9:07 AM Nick Telford <

nick.telf...@gmail.com>

wrote:

Hi everyone,

KIP-989 is a small Kafka Streams KIP to add a few

new

metrics

around

the

creation and use of RocksDB Iterators, to aid

users

in

identifying

"Iterator leaks" that could cause applications to

leak

native

memory.


Let me know what you think!

https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics


P.S. I'm not too sure about the formatting of the

"New

Metrics"

table,

any

advice there would be appreciated.

Regards,
Nick

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

Reply via email to