Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2019-11-24 Thread Matthias J. Sax
Moved this KIP into status "inactive". Feel free to resume and any time. -Matthias On 7/15/18 6:55 PM, Matthias J. Sax wrote: > I think it would make a lot of sense to provide a simple DSL abstraction. > > Something like: > > KStream stream = ... > KTable count = stream.count(); > > The

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-15 Thread Matthias J. Sax
I think it would make a lot of sense to provide a simple DSL abstraction. Something like: KStream stream = ... KTable count = stream.count(); The missing groupBy() or grouByKey() class indicates a global counting operation. The JavaDocs should highlight the impact. One open question is, what

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-06 Thread Guozhang Wang
That's a lot of email exchanges for me to catch up :) My original proposed alternative solution is indeed relying on pre-aggregate before sending to the single-partition topic, so that the traffic on that single-partition topic would not be huge (I called it partial-aggregate but the intent was

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-05 Thread John Roesler
Ok, I didn't get quite as far as I hoped, and several things are far from ready, but here's what I have so far: https://github.com/apache/kafka/pull/5337 The "unit" test works, and is a good example of how you should expect it to behave:

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-05 Thread John Roesler
Hey Flávio, Thanks! I haven't got anything usable yet, but I'm working on it now. I'm hoping to push up my branch by the end of the day. I don't know if you've seen it but Streams actually already has something like this, in the form of caching on materialized stores. If you pass in a

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-04 Thread flaviostutz
John, that was fantastic, man! Have you built any custom implementation of your KIP in your machine so that I could test it out here? I wish I could test it out. If you need any help implementing this feature, please tell me. Thanks. -Flávio Stutz On 2018/07/03 18:04:52, John Roesler

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-03 Thread John Roesler
Hi Flávio, Thanks! I think that we can actually do this, but the API could be better. I've included Java code below, but I'll copy and modify your example so we're on the same page. EXERCISE 1: - The case is "total counting of events for a huge website" - Tasks from Application A will have

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-03 Thread flaviostutz
Great feature you have there! I'll try to exercise here how we would achieve the same functional objectives using your KIP: EXERCISE 1: - The case is "total counting of events for a huge website" - Tasks from Application A will have something like: .stream(/site-events)

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-02 Thread John Roesler
Hi Flávio, Sure thing. And apologies in advance if I missed the point. Below is some more-or-less realistic Java code to demonstrate how, given a high-volume (heavily partitioned) stream of purchases, we can "step down" the update rate with rate-limited intermediate aggregations. Please bear in

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-02 Thread flaviostutz
Thanks for clarifying the real usage of KIP-328. Now I understood a bit better. I didn't see how that feature would be used to minimize the number of publications to the single partitioned output topic. When it is falls into supression, the graph stops going down? Could you explain better? If

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-02 Thread John Roesler
Hi Flávio, Thanks for the KIP. I'll apologize that I'm arriving late to the discussion. I've tried to catch up, but I might have missed some nuances. Regarding KIP-328, the idea is to add the ability to suppress intermediate results from all KTables, not just windowed ones. I think this could

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-01 Thread flaviostutz
For what I understood, that KIP is related to how KStreams will handle KTable updates in Windowed scenarios to optimize resource usage. I couldn't see any specific relation to this KIP. Had you? -Flávio Stutz On 2018/06/29 18:14:46, "Matthias J. Sax" wrote: > Flavio, > > thanks for cleaning

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-01 Thread flaviostutz
> I agree with Guozhang on comparing the pros and cons of the approach he > outlined vs the one in the proposed KIP. I've just replied him. Please take a look. > Will the triggering mechanism always be time, or would it make sense to > expand to use other mechanisms such as the number of records,

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-07-01 Thread flaviostutz
Cons: We tried the "single partition" strategy, but the problem is that for each incoming message to the Graph, we have another output message with the aggregated (cummulative or not) result, so that if we have a million messages/s (among all parallel tasks) being processed, we'll have another

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-06-29 Thread Guozhang Wang
Repasting my comment from the other email thread: -- Flávio, thanks for creating this KIP. I think this "single-aggregation" use case is common enough that we should consider how to efficiently supports it: for example, for KSQL that's built on top of Streams, we've seen

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-06-29 Thread Bill Bejeck
Hi Flávio, Thanks for creating the KIP. I agree with Guozhang on comparing the pros and cons of the approach he outlined vs the one in the proposed KIP. I also have a few clarification questions on the current KIP Will the triggering mechanism always be time, or would it make sense to expand

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-06-29 Thread Matthias J. Sax
Flavio, thanks for cleaning up the KIP number collision. With regard to KIP-328 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables) I am wondering how both relate to each other? Any thoughts? -Matthias On 6/29/18 10:23 AM,

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-06-29 Thread flaviostutz
Just copying a follow up from another thread to here (sorry about the mess): From: Guozhang Wang Subject: Re: [DISCUSS] KIP-323: Schedulable KTable as Graph source Date: 2018/06/25 22:24:17 List: dev@kafka.apache.org Flávio, thanks for creating this KIP. I think this "single-aggregation" use

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-06-29 Thread tiagostutz
Actually this was intended to be registered under the number 323, but someone else catch this number while the proposal was being edited. The correct number and URL of the KIP is: https://cwiki.apache.org/confluence/display/KAFKA/KIP-326%3A+Schedulable+KTable+as+Graph+source On 2018/06/26

Re: [DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-06-26 Thread Ted Yu
What's the relationship between this KIP and KIP-323 ? Thanks On Tue, Jun 26, 2018 at 11:22 AM, Flávio Stutz wrote: > Hey, guys, I've just created a new KIP about creating a new DSL graph > source for realtime partitioned consolidations. > > We have faced the following scenario/problem in a

[DISCUSS] KIP-326: Schedulable KTable as Graph source

2018-06-26 Thread Flávio Stutz
Hey, guys, I've just created a new KIP about creating a new DSL graph source for realtime partitioned consolidations. We have faced the following scenario/problem in a lot of situations with KStreams: - Huge incoming data being processed by numerous application instances - Need to aggregate