Quick addendum: My suggested metric "oldest-open-iterator-age-seconds" should be "oldest-open-iterator-age-ms". Milliseconds is obviously a better granularity for such a metric.
Still accepting suggestions for a better name. On Thu, 28 Mar 2024 at 13:41, Nick Telford <nick.telf...@gmail.com> wrote: > Hi everyone, > > Sorry for leaving this for so long. So much for "3 weeks until KIP freeze"! > > On Sophie's comments: > 1. Would Matthias's suggestion of a separate metric tracking the age of > the oldest open iterator (within the tag set) satisfy this? That way we can > keep iterator-duration-(avg|max) for closed iterators, which can be useful > for performance debugging for iterators that don't leak. I'm not sure what > we'd call this metric, maybe: "oldest-open-iterator-age-seconds"? Seems > like a mouthful. > > 2. You're right, it makes more sense to provide > iterator-duration-(avg|max). Honestly, I can't remember why I had "total" > before, or why I was computing a rate-of-change over it. > > 3, 4, 5, 6. Agreed, I'll make all those changes as suggested. > > 7. Combined with Matthias's point about RocksDB, I'm convinced that this > is the wrong KIP for these. I'll introduce the additional Rocks metrics in > another KIP. > > On Matthias's comments: > A. Not sure about the time window. I'm pretty sure all existing avg/max > metrics are since the application was started? Any other suggestions here > would be appreciated. > > B. Agreed. See point 1 above. > > C. Good point. My focus was very much on Rocks memory leaks when I wrote > the first draft. I can generalise it. My only concern is that it might make > it more difficult to detect Rocks iterator leaks caused *within* our > high-level iterator, e.g. RocksJNI, RocksDB, RocksDBStore, etc. But we > could always provide a RocksDB-specific metric for this, as you suggested. > > D. Hmm, we do already have MeteredKeyValueIterator, which automatically > wraps the iterator from inner-stores of MeteredKeyValueStore. If we > implemented these metrics there, then custom stores would automatically > gain the functionality, right? This seems like a pretty logical place to > implement these metrics, since MeteredKeyValueStore is all about adding > metrics to state stores. > > > I imagine the best way to implement this would be to do so at the > > high-level iterator rather than implementing it separately for each > > specific iterator implementation for every store type. > > Sophie, does MeteredKeyValueIterator fit with your recommendation? > > Thanks for your thoughts everyone, I'll update the KIP now. > > Nick > > On Thu, 14 Mar 2024 at 03:37, Sophie Blee-Goldman <sop...@responsive.dev> > wrote: > >> About your last two points: I completely agree that we should try to >> make this independent of RocksDB, and should probably adopt a >> general philosophy of being store-implementation agnostic unless >> there is good reason to focus on a particular store type: eg if it was >> only possible to implement for certain stores, or only made sense in >> the context of a certain store type but not necessarily stores in general. >> >> While leaking memory due to unclosed iterators on RocksDB stores is >> certainly the most common issue, I think Matthias sufficiently >> demonstrated that the problem of leaking iterators is not actually >> unique to RocksDB, and we should consider including in-memory >> stores at the very least. I also think that at this point, we may as well >> just implement the metrics for *all* store types, whether rocksdb or >> in-memory or custom. Not just because it probably applies to all >> store types (leaking iterators are rarely a good thing!) but because >> I imagine the best way to implement this would be to do so at the >> high-level iterator rather than implementing it separately for each >> specific iterator implementation for every store type. >> >> That said, I haven't thought all that carefully about the implementation >> yet -- it just strikes me as easiest to do at the top level of the store >> hierarchy rather than at the bottom. My gut instinct may very well be >> wrong, but that's what it's saying >> >> On Thu, Mar 7, 2024 at 10:43 AM Matthias J. Sax <mj...@apache.org> wrote: >> >> > Seems I am late to this party. Can we pick this up again aiming for 3.8 >> > release? I think it would be a great addition. Few comments: >> > >> > >> > - I think it does make sense to report `iterator-duration-avg` and >> > `iterator-duration-max` for all *closed* iterators -- it just seems to >> > be a useful metric (wondering if this would be _overall_ or bounded to >> > some time window?) >> > >> > - About the duration iterators are currently open, I believe the only >> > useful way is to report the "oldest iterator", ie, the smallest iterator >> > open-time, of all currently open-iterator? We all agree that in general, >> > leaking iterator would bump the count metric, and if there is a few ones >> > which are not closed and open for a long time, it seem sufficient to >> > detect the single oldest one for alerting purpose? >> > >> > - What I don't like about the KIP is it focus on RocksDB. I don't think >> > we should build on the internal RocksDB counters as proposed (I guess, >> > we could still expose them, similar to other RocksDB metrics which we >> > expose already). However, for this new metric, we should track it >> > ourselves and thus make it independent of RocksDB -- in the end, an >> > in-memory store could also leak memory (and kill a JVM with an >> > out-of-memory error) and we should be able to track it. >> > >> > - Not sure if we would like to add support for custom stores, to allow >> > them to register their iterators with this metric? Or would this not be >> > necessary, because custom stores could just register a custom metric >> > about it to begin with? >> > >> > >> > >> > -Matthias >> > >> > On 10/25/23 4:41 PM, Sophie Blee-Goldman wrote: >> > >> >> > >> If we used "iterator-duration-max", for >> > >> example, would it not be confusing that it includes Iterators that >> are >> > >> still open, and therefore the duration is not yet known? >> > > >> > > >> > > 1. Ah, I think I understand your concern better now -- I totally agree >> > that >> > > a >> > > "iterator-duration-max" metric would be confusing/misleading. I was >> > > thinking about it a bit differently, something more akin to the >> > > "last-rebalance-seconds-ago" consumer metric. As the name suggests, >> > > that basically just tracks how long the consumer has gone without >> > > rebalancing -- it doesn't purport to represent the actual duration >> > between >> > > rebalances, just the current time since the last one. The hard part >> is >> > > really >> > > in choosing a name that reflects this -- maybe you have some better >> ideas >> > > but off the top of my head, perhaps something like >> > "iterator-lifetime-max"? >> > > >> > > 2. I'm not quite sure how to interpret the "iterator-duration-total" >> > metric >> > > -- what exactly does it mean to add up all the iterator durations? For >> > > some context, while this is not a hard-and-fast rule, in general >> you'll >> > > find that Kafka/Streams metrics tend to come in pairs of avg/max or >> > > rate/total. Something that you might measure the avg for usually is >> > > also useful to measure the max, whereas a total metric is probably >> > > also useful as a rate but not so much as an avg. I actually think this >> > > is part of why it feels like it makes so much sense to include a "max" >> > > version of this metric, as Lucas suggested, even if the name of >> > > "iterator-duration-max" feels misleading. Ultimately the metric names >> > > are up to you, but for this reason, I would personally advocate for >> > > just going with an "iterator-duration-avg" and "iterator-duration-max" >> > > >> > > I did see your example in which you mention one could monitor the >> > > rate of change of the "-total" metric. While this does make sense to >> > > me, if the only way to interpret a metric is by computing another >> > > function over it, then why not just make that computation the metric >> > > and cut out the middle man? And in this case, to me at least, it feels >> > > much easier to understand a metric like "iterator-duration-max" vs >> > > something like "iterator-duration-total-rate" >> > > >> > > 3. By the way, can you add another column to the table with the new >> > metrics >> > > that lists the recording level? My suggestion would be to put the >> > > "number-open-iterators" at INFO and the other two at DEBUG. See >> > > the following for my reasoning behind this recommendation >> > > >> > > 4. I would change the "Type" entry for the "number-open-iterators" >> from >> > > "Value" to "Gauge". This helps justify the "INFO" level for this >> metric, >> > > since unlike the other metrics which are "Measurables", the current >> > > timestamp won't need to be retrieved on each recording >> > > >> > > 5. Can you list the tags that would be associated with each of these >> > > metrics (either in the table, or separately above/below if they will >> be >> > > the same for all) >> > > >> > > 6. Do you have a strong preference for the name >> "number-open-iterators" >> > > or would you be alright in shortening this to "num-open-iterators"? >> The >> > > latter is more in line with the naming scheme used elsewhere in Kafka >> > > for similar kinds of metrics, and a shorter name is always nice. >> > > >> > > 7. With respect to the rocksdb cache metrics, those sound useful but >> > > if it was me, I would probably save them for a separate KIP mainly >> just >> > > because the KIP freeze deadline is in a few weeks, and I wouldn't want >> > > to end up blocking all the new metrics just because there was ongoing >> > > debate about a subset of them. That said, you do have 3 full weeks, so >> > > I would hope that you could get both sets of metrics agreed upon in >> > > that timeframe! >> > > >> > > >> > > On Tue, Oct 24, 2023 at 6:35 AM Nick Telford <nick.telf...@gmail.com> >> > wrote: >> > > >> > >> I don't really have a problem with adding such a metric, I'm just not >> > >> entirely sure how it would work. If we used "iterator-duration-max", >> for >> > >> example, would it not be confusing that it includes Iterators that >> are >> > >> still open, and therefore the duration is not yet known? When >> graphing >> > that >> > >> over time, I suspect it would be difficult to understand. >> > >> >> > >> 3. >> > >> FWIW, this would still be picked up by "open-iterators", since that >> > metric >> > >> is only decremented when Iterator#close is called (via the >> > >> ManagedKeyValueIterator#onClose hook). >> > >> >> > >> I'm actually considering expanding the scope of this KIP slightly to >> > >> include improved Block Cache metrics, as my own memory leak >> > investigations >> > >> have trended in that direction. Do you think the following metrics >> > should >> > >> be included in this KIP, or should I create a new KIP? >> > >> >> > >> - block-cache-index-usage (number of bytes occupied by index >> blocks) >> > >> - block-cache-filter-usage (number of bytes occupied by filter >> > blocks) >> > >> >> > >> Regards, >> > >> Nick >> > >> >> > >> On Tue, 24 Oct 2023 at 07:09, Sophie Blee-Goldman < >> > sop...@responsive.dev> >> > >> wrote: >> > >> >> > >>> I actually think we could implement Lucas' suggestion pretty easily >> and >> > >>> without too much additional effort. We have full control over the >> > >> iterator >> > >>> that is returned by the various range queries, so it would be easy >> to >> > >>> register a gauge metric for how long it has been since the iterator >> was >> > >>> created. Then we just deregister the metric when the iterator is >> > closed. >> > >>> >> > >>> With respect to how useful this metric would be, both Nick and Lucas >> > have >> > >>> made good points: I would agree that in general, leaking iterators >> > would >> > >>> mean an ever-increasing iterator count that should be possible to >> spot >> > >>> without this. However, a few things to consider: >> > >>> >> > >>> 1. it's really easy to set up an alert based on some maximum >> threshold >> > of >> > >>> how long an iterator should remain open for. It's relatively more >> > tricky >> > >> to >> > >>> set up alerts based on the current count of open iterators and how >> it >> > >>> changes over time. >> > >>> 2. As Lucas mentioned, it only takes a few iterators to wreak havoc >> in >> > >>> extreme cases. Sometimes more advanced applications end up with >> just a >> > >> few >> > >>> leaking iterators despite closing the majority of them. I've seen >> this >> > >>> happen just once personally, but it was driving everyone crazy >> until we >> > >>> figured it out. >> > >>> 3. A metric for how long the iterator has been open would help to >> > >> identify >> > >>> hanging iterators due to some logic where the iterator is properly >> > closed >> > >>> but for whatever reason just isn't being advanced to the end, and >> thus >> > >> not >> > >>> reached the iterator#close line of the user code. This case seems >> > >> difficult >> > >>> to spot without the specific metric for iterator lifetime >> > >>> 4. Lastly, I think you could argue that all of the above are fairly >> > >>> advanced use cases, but this seems like a fairly advanced feature >> > >> already, >> > >>> so it doesn't seem unreasonable to try and cover all the bases. >> > >>> >> > >>> All that said, my philosophy is that the KIP author gets the final >> word >> > >> on >> > >>> what to pull into scope as long as the proposal isn't harming anyone >> > >>> without the extra feature/changes. So it's up to you Nick -- just >> > wanted >> > >>> to add some context on how it could work, and why it would be >> helpful >> > >>> >> > >>> Thanks for the KIP! >> > >>> >> > >>> On Wed, Oct 18, 2023 at 7:04 AM Lucas Brutschy >> > >>> <lbruts...@confluent.io.invalid> wrote: >> > >>> >> > >>>> Hi Nick, >> > >>>> >> > >>>> I did not think in detail about how to implement it, just about >> what >> > >>>> metrics would be nice to have. You are right, we'd have to >> > >>>> register/deregister the iterators during open/close. This would be >> > >>>> more complicated to implement than the other metrics, but I do not >> see >> > >>>> a fundamental problem with it. >> > >>>> >> > >>>> As far as I understand, even a low number of leaked iterators can >> hurt >> > >>>> RocksDB compaction significantly. So we may even want to detect if >> the >> > >>>> iterators are opened by one-time / rare queries against the state >> > >>>> store. >> > >>>> >> > >>>> But, as I said, that would be an addition and not a change of the >> > >>>> current contents of the KIP, so I'd support the KIP moving forward >> > >>>> even without this extension. >> > >>>> >> > >>>> Cheers, Lucas >> > >>>> >> > >>>> >> > >>>> >> > >>>> On Tue, Oct 17, 2023 at 3:45 PM Nick Telford < >> nick.telf...@gmail.com> >> > >>>> wrote: >> > >>>>> >> > >>>>> Hi Lucas, >> > >>>>> >> > >>>>> Hmm, I'm not sure how we could reliably identify such leaked >> > >> Iterators. >> > >>>> If >> > >>>>> we tried to include open iterators when calculating >> > >> iterator-duration, >> > >>>> we'd >> > >>>>> need some kind of registry of all the open iterator creation >> > >>> timestamps, >> > >>>>> wouldn't we? >> > >>>>> >> > >>>>> In general, if you have a leaky Iterator, it should manifest as a >> > >>>>> persistently climbing "open-iterators" metric, even on a busy >> node, >> > >>>> because >> > >>>>> each time that Iterator is used, it will leak another one. So >> even in >> > >>> the >> > >>>>> presence of many non-leaky Iterators on a busy instance, the >> metric >> > >>>> should >> > >>>>> still consistently climb. >> > >>>>> >> > >>>>> Regards, >> > >>>>> >> > >>>>> Nick >> > >>>>> >> > >>>>> On Mon, 16 Oct 2023 at 14:24, Lucas Brutschy < >> lbruts...@confluent.io >> > >>>> .invalid> >> > >>>>> wrote: >> > >>>>> >> > >>>>>> Hi Nick! >> > >>>>>> >> > >>>>>> thanks for the KIP! I think this could be quite useful, given the >> > >>>>>> problems that we had with leaks due to RocksDB resources not >> being >> > >>>>>> closed. >> > >>>>>> >> > >>>>>> I don't have any pressing issues why we can't accept it like it >> is, >> > >>>>>> just one minor point for discussion: would it also make sense to >> > >> make >> > >>>>>> it possible to identify a few very long-running / leaked >> > >> iterators? I >> > >>>>>> can imagine on a busy node, it would be hard to spot that 1% of >> the >> > >>>>>> iterators never close when looking only at closed iterator or the >> > >>>>>> number of iterators. But it could still be good to identify those >> > >>>>>> leaks early. One option would be to add `iterator-duration-max` >> and >> > >>>>>> take open iterators into account when computing the metric. >> > >>>>>> >> > >>>>>> Cheers, >> > >>>>>> Lucas >> > >>>>>> >> > >>>>>> >> > >>>>>> On Thu, Oct 5, 2023 at 3:50 PM Nick Telford < >> > >> nick.telf...@gmail.com> >> > >>>>>> wrote: >> > >>>>>>> >> > >>>>>>> Hi Colt, >> > >>>>>>> >> > >>>>>>> I kept the details out of the KIP, because KIPs generally >> > >> document >> > >>>>>>> high-level design, but the way I'm doing it is like this: >> > >>>>>>> >> > >>>>>>> final ManagedKeyValueIterator<Bytes, byte[]> >> > >>>>>>> rocksDbPrefixSeekIterator = cf.prefixScan(accessor, >> prefixBytes); >> > >>>>>>> --> final long startedAt = System.nanoTime(); >> > >>>>>>> openIterators.add(rocksDbPrefixSeekIterator); >> > >>>>>>> rocksDbPrefixSeekIterator.onClose(() -> { >> > >>>>>>> --> >> > >>> metricsRecorder.recordIteratorDuration(System.nanoTime() >> > >>>> - >> > >>>>>>> startedAt); >> > >>>>>>> openIterators.remove(rocksDbPrefixSeekIterator); >> > >>>>>>> }); >> > >>>>>>> >> > >>>>>>> The lines with the arrow are the new code. This pattern is >> > >> repeated >> > >>>>>>> throughout RocksDBStore, wherever a new RocksDbIterator is >> > >> created. >> > >>>>>>> >> > >>>>>>> Regards, >> > >>>>>>> Nick >> > >>>>>>> >> > >>>>>>> On Thu, 5 Oct 2023 at 12:32, Colt McNealy <c...@littlehorse.io> >> > >>>> wrote: >> > >>>>>>> >> > >>>>>>>> Thank you for the KIP, Nick! >> > >>>>>>>> >> > >>>>>>>> This would be highly useful for many reasons. Much more sane >> > >> than >> > >>>>>> checking >> > >>>>>>>> for leaked iterators by profiling memory usage while running >> > >>> 100's >> > >>>> of >> > >>>>>>>> thousands of range scans via interactive queries (: >> > >>>>>>>> >> > >>>>>>>> One small question: >> > >>>>>>>> >> > >>>>>>>>> The iterator-duration metrics will be updated whenever an >> > >>>> Iterator's >> > >>>>>>>> close() method is called >> > >>>>>>>> >> > >>>>>>>> Does the Iterator have its own "createdAt()" or equivalent >> > >> field, >> > >>>> or >> > >>>>>> do we >> > >>>>>>>> need to keep track of the Iterator's start time upon creation? >> > >>>>>>>> >> > >>>>>>>> Cheers, >> > >>>>>>>> Colt McNealy >> > >>>>>>>> >> > >>>>>>>> *Founder, LittleHorse.dev* >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> On Wed, Oct 4, 2023 at 9:07 AM Nick Telford < >> > >>>> nick.telf...@gmail.com> >> > >>>>>>>> wrote: >> > >>>>>>>> >> > >>>>>>>>> Hi everyone, >> > >>>>>>>>> >> > >>>>>>>>> KIP-989 is a small Kafka Streams KIP to add a few new metrics >> > >>>> around >> > >>>>>> the >> > >>>>>>>>> creation and use of RocksDB Iterators, to aid users in >> > >>>> identifying >> > >>>>>>>>> "Iterator leaks" that could cause applications to leak native >> > >>>> memory. >> > >>>>>>>>> >> > >>>>>>>>> Let me know what you think! >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>> >> > >>>>>> >> > >>>> >> > >>> >> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics >> > >>>>>>>>> >> > >>>>>>>>> P.S. I'm not too sure about the formatting of the "New >> > >> Metrics" >> > >>>>>> table, >> > >>>>>>>> any >> > >>>>>>>>> advice there would be appreciated. >> > >>>>>>>>> >> > >>>>>>>>> Regards, >> > >>>>>>>>> Nick >> > >>>>>>>>> >> > >>>>>>>> >> > >>>>>> >> > >>>> >> > >>> >> > >> >> > > >> > >> >