[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702980#comment-17702980 ] Matthias J. Sax commented on KAFKA-7224: With [https://cwiki.apache.org/confluence/display/KAFKA/KIP-825%3A+introduce+a+new+API+to+control+when+aggregated+results+are+produced] added to 3.3, so we still want/need this one? > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105278#comment-17105278 ] Maatari commented on KAFKA-7224: Got it. Many thanks. > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103420#comment-17103420 ] John Roesler commented on KAFKA-7224: - Hi Maatari, Yeah, it probably seems beside the point because it is beside the point. I probably shouldn’t have mentioned it. I guess I was just thinking that when the general problem is “I get too many updates in the output”, some of those are idempotent, while others are non-idempotent. If we eliminate the idempotent updates, then maybe the number of updates on the output side drops below the “too many” threshold and the problem goes away. Of course, if you want a guarantee, such as a rate limit or that you don’t emit _any_ result until some specific time, then of course you need something with those semantics, which is orthogonal to whether there are idempotent updates or not. > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102667#comment-17102667 ] Maatari commented on KAFKA-7224: [~vvcephei] thank you so much for your input on this. Understood everything well expect the last point, related to the emit-on-change. I do not see why ktable0.join(ktable1.groupby.reduce) Can cause idempotent updates ? I red the KIP KIP-557 and the example with the PageViewID, SessionID, and can see the idempotent update here, but not with the code above. Would you please elaborate a bit on this ? > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097684#comment-17097684 ] John Roesler commented on KAFKA-7224: - Hi all, Thanks for the good points all around. Just to close the loop on _this_ ticket (disk-based suppression). It was _extremely_ poor performance. So much so that my thinking was that for anyone with high enough volume to actually need suppression, it would be too slow to be useful. The problem is that we need to check the beginning of the suppression buffer on every (or almost every) record, to see if we need to emit something. For an in-memory store, this is fine, but for RocksDB in particular, the scan performance is very slow. There are fundamental reasons why this is the case, which we don't need to get into here. It might be possible to cleverly engineer our way around the problem, but anything I came up with just sounded too complicated to be worth it. However, this is only necessary if you want the semantics of Suppress (each record times out individually, based on stream time). If you instead just want to buffer everything on disk and then emit everything you've buffered, say once an hour, you can do it much more efficiently in a custom FlatTransformValues where you put all incoming data into the store, then schedule a wall-clock punctuation to scan the entire store and forward everything. The one complication is that the wall-clock punctuation currently blocks the StreamThread, so you need to have some sense of how long it will take (observed empirically) and make sure that you set the {{max.poll.interval.ms}} with enough head-room so you won't drop out of the group. This is bleeding more into the domain of KIP-424, which does seem more like what [~maatdeamon] needs (just agreeing with the discussion so far). I don't think there was any technical impediment to implementing that one, it was just that the KIP discussion petered out (which happens sometimes). I guess, building on my last paragraphs, _if_ we had wall-clock-based suppression, _then_ it might make more sense to offer on-disk suppression in addition to in-memory, as at least the (wall-clock + on-disk) configuration could be performant. But it would need much more design. I'm still unsure if on-disk suppression is really a good idea to implement in the DSL. A final thought worth mentioning in this discussion is that KIP-557 ( [https://cwiki.apache.org/confluence/display/KAFKA/KIP-557%3A+Add+emit+on+change+support+for+Kafka+Streams] ) will go a long way toward dropping unnecessary updates. This isn't the same thing as suppressing intermediate results, but it will help a great deal to at least drop idempotent updates early in the topology and not even have to suppress them at the end. Thanks, -John > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097043#comment-17097043 ] Maatari commented on KAFKA-7224: Yes, will look at how to go about this. > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097040#comment-17097040 ] Matthias J. Sax commented on KAFKA-7224: Thanks for clarification. I missed the point that you can allow `suppress()` to also emit if the buffer is full. For this case, having a larger buffer could help to reduce intermediate results. My bad. About KIP-428: I am not sure atm what was proposed, but I agree that only a pure wall-clock-time emit strategy does make sense to have a good rate control. Btw: I think [~vvcephei] actually work on RocksDB support for suppress() but the performance was pretty bad and he never finished it. In any case: If you really need a feature and nobody is working on it, feel free to pick it up. Kafka is an open source project after all. > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097041#comment-17097041 ] Maatari commented on KAFKA-7224: Finally, I don't think providing that would only support my use case, i think solving this, go in the direction of the statement {quote}Close the gap between the semantics of KTables in streams and tables in relational databases. It is common practice to capture changes as they are made to tables in a RDBMS into Kafka topics (JDBC-connect, Debezium, Maxwell). These entities typically have multiple one-to-many relationship. Usually RDBMSs offer good support to resolve this relationship with a join. Streams falls short here and the workaround (group by - join - lateral view) is not well supported as well and is not in line with the idea of record based processing. {quote} found here [https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable] > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097032#comment-17097032 ] Maatari commented on KAFKA-7224: {quote}*However, if there was a way to enforce a maximum time a records stay in the buffer without being emitted,* {quote} {quote}Well, the current suppress does this. Or do you refer to wall-clock time? {quote} I think a bit of confusion here as well. What i mean is exactly the last point i refer to in my last message. So to clarify, based on the last comment of my last message, if for you *wall-clock-time* emit strategy means not being event driven, as the author suggested, but driven by the wall-clock only, then yes i do refer to the wall clock time when i say this. {quote}I cannot follow here. If you buffer and suppress updates to the same key and emit update in a certain "frequency" there is no difference if you do this in-memory of if you spill to disk. The only difference is, how many unique keys the suppress buffer can handle: for in-memory the number of unique keys is smaller as all the data must fit into main-memory, while RocksDB would allow to process more unique keys. But the number of unique keys is independent to the number of intermediate result (that you need to count _per key_ as updates to two different keys would never suppress each other). {quote} You are spot on my point when you mention that rocksDB would allow to process (suppress) more unique keys. Beside that obviously, my thinking was the more unique keys i can holds, the more suppression i can do without evicting things. However i do not understand your last statement. {quote}But the number of unique keys is independent to the number of intermediate result (that you need to count _per key_ as updates to two different keys would never suppress each other). {quote} You do not think that the bigger the suppression buffer, wether in memory or on disk, the more suppression you can do ? So far if i understood you well, it sounds like a combination of KIP-328 + KIP-242 (wall-clock-time emit strategy) would solve my use case no ? How to get there is another question, but at least making sure to go in the right direction is important. I like the approche of keeping the semantic of the stream, separate from the operational concern [https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers/] > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097025#comment-17097025 ] Maatari commented on KAFKA-7224: Thank you so much for your clarification it helps a lot. Will try to clarify some of my confusing statement. {quote}What do you mean by "at the end of the topology"? There is nothing like this. Note that the input is not a "finite table" but the "infinite table changelog stream". {quote} I just meant having something like this {code:java} ktable0.join(ktable1.groupby.reduce).supress(...){code} It is my language here that was misleading. I agree with you, it is not a finite table. What i want is to significantly mitigate the intermediary results. {quote}That is by design. Because the input may contain out-of-order data, time cannot easily be advanced if the input stream "stalls". Otherwise, the whole operation becomes non-deterministic (what might be ok for your use case though). This would require some wall-clock time emit strategy though (as you mentioned already, ie, KP-424).{quote} As you suggest above it is exactly what would put me in the right direction, given my use case. I will specifically adopt your language *wall-clock time emit strategy.* Is that really what was intended in KIP-424. In that page [https://cwiki.apache.org/confluence/display/KAFKA/KIP-424%3A+Allow+suppression+of+intermediate+events+based+on+wall+clock+time] The author specifically says: _"However, checks against the wall clock are event driven: if no events are received in over 5 seconds, no events will be sent downstream"_ Hence, just to clarify, to you mean the same thing when you say wall-clock time emit strategy ? Because if that is the case, the same problem as above will, me, some records can still stay stuck if nothing else comes in. It is important, because i was to ask from your point of you if it is even feasible, to have wall-clock time used as i mean. That is, if the time of a key, passed the configure time, even if no new record have been ingested, emit the record anyway. > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096976#comment-17096976 ] Matthias J. Sax commented on KAFKA-7224: It seem we use the term "intermediate" result in the same way. However, note, that for a "KTable-KTable" join there is no "final" result: the result is by definition an infinite changelog streams: for each update the input tables, and new result update record is produced. Hence, the only thing you can do it, to say: don't give me every update, but (for the same key) only a subset of updates. {quote}cause if i want to suppress all the intermediary result let say at the end of the topology above {quote} What do you mean by "at the end of the topology"? There is nothing like this. Note that the input is not a "finite table" but the "infinite table changelog stream". {quote}given the frequency with which the database is updated, i can find myself with records, stuck in the supression buffer. Indeed it is stream time {quote} That is by design. Because the input may contain out-of-order data, time cannot easily be advanced if the input stream "stalls". Otherwise, the whole operation becomes non-deterministic (what might be ok for your use case though). This would require some wall-clock time emit strategy though (as you mentioned already, ie, KP-424). {quote}However, if there was a way to enforce a maximum time a records stay in the buffer without being emitted, {quote} Well, the current suppress does this. Or do you refer to wall-clock time? {quote}and if that buffer was rocksDB, then i think i could massively mitigate those intermediary result, and produce despite the frequency of the db i am ready the data from. {quote} I cannot follow here. If you buffer and suppress updates to the same key and emit update in a certain "frequency" there is no difference if you do this in-memory of if you spill to disk. The only difference is, how many unique keys the suppress buffer can handle: for in-memory the number of unique keys is smaller as all the data must fit into main-memory, while RocksDB would allow to process more unique keys. But the number of unique keys is independent to the number of intermediate result (that you need to count _per key_ as updates to two different keys would never suppress each other). > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096868#comment-17096868 ] Maatari commented on KAFKA-7224: I have played a bit with the supress using untilTimeLimit, but with no sucesss, cause if i want to supress of the intermediary result let say at the end of the topology above, given the frequency with which the database is updated, i can find myself with records, stuck in the supression buffer. Indeed it is stream time, if it does not progress, then a record might find itself never emitted. Beside i would need quite a large time, to have an effective suppression. My understanding now is that UntilTimeLimit is event driven, which i did not know, when i first posted my message. However, if there was a way to enforce a maximum time a records stay in the buffer without being emitted, and if that buffer was rocksDB, then i think i could massively mitigate those intermediary result, and produce despite the frequency of the db i am ready the data from. > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096863#comment-17096863 ] Maatari commented on KAFKA-7224: What i call intermediate result, is in the following context. Let say you have the following topology {code:java} ktable0.join(ktable1.groupby.reduce){code} Where the reduce just act as the collectList in KSQL. This is a use case we have we need like this. There is a repartition topic at the groupby, and therefore you would emit, multiple time the same records, while the list collected with the reduce will keep increasing, until the entire topic is consume. This next generate, multiple results for join as well, as the same key on the right of the join will come multiple time. So you end up having systematic every growing version of records. That is what i call intermediate result. This is a way to build views on normalize data, that build entity with reference to all its outgoing links. We use to do that in our databases, but it was not scaling. > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096237#comment-17096237 ] Matthias J. Sax commented on KAFKA-7224: This ticket would not reduce intermediate result. Not sure what issue you are facing with "too many intermediate" results. Are you using `suppress()` already? If yes, what issue do you face? Also, wall-clock time suppression does not help to reduce intermediate result, but it make suppression non-deterministic. I might be helpful for some user-cases, ie, output rate control. But I am not sure how it relates to your use case? > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096139#comment-17096139 ] Maatari commented on KAFKA-7224: Def something our team is longing for. There are serious use case around. This feature would definitely unlock the most critical issue we are facing with our kafka stream application, to many intermediary result at this point. We load entire databases and build views that represent the complete entities of the domain though joins. Although functionally things work, operationally, there is just too much intermediary results. Having this in combination to [https://cwiki.apache.org/confluence/display/KAFKA/KIP-424%3A+Allow+suppression+of+intermediate+events+based+on+wall+clock+time] Would be the killer > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054271#comment-17054271 ] ASF GitHub Bot commented on KAFKA-7224: --- vvcephei commented on pull request #6428: KAFKA-7224: [WIP] Persistent Suppress [WIP] URL: https://github.com/apache/kafka/pull/6428 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Assignee: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054270#comment-17054270 ] John Roesler commented on KAFKA-7224: - I didn't realized I'd left this ticket in progress. I intended to shelve this work until there was some concrete ask for it. After the implementation in the PR, I ran some benchmarks, and I found that the performance with rocksdb-backed suppression was _absolutely terrible_ I think it was like two orders of magnitude slower. Much slower even than regular rocksdb-backed persistent store operations. The key problem was that the suppression buffer relies on scans, and scans in RocksDB are absurdly slow. I looked into rocksdb optimizations, but didn't find anything remotely promising. It might be the case that you'd be fine with a huge performance penalty in exchange for the "final result" semantics, but it seems like it would have to be a very niche use case: low throughput (so the performance is tolerable) but large amounts of intermediate results (so that the in-memory buffer wouldn't be sufficient). I wasn't confident that such a use case would actually exist, and on the other hand, it felt like a massive potential for frustration to drop such a poor-performing component into the codebase, even if I were to pepper the javadocs with warnings about it. So I decided just to pause work on it pending more information -John > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Assignee: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression
[ https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789906#comment-16789906 ] ASF GitHub Bot commented on KAFKA-7224: --- vvcephei commented on pull request #6428: KAFKA-7224: [WIP] Persistent Suppress [WIP] URL: https://github.com/apache/kafka/pull/6428 WIP - no need to review. I'm just getting a copy of this onto github. I'll call for reviews once I think it's ready. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > KIP-328: Add spill-to-disk for Suppression > -- > > Key: KAFKA-7224 > URL: https://issues.apache.org/jira/browse/KAFKA-7224 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: John Roesler >Assignee: John Roesler >Priority: Major > > As described in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables] > Following on KAFKA-7223, implement the spill-to-disk buffering strategy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)