Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Matthias J. Sax Sat, 09 Nov 2019 20:38:17 -0800

> it seems like we do want to allow
>> people to optionally specify a partition count as part of this
>> operation, but we don't want that option to _force_ repartitioning


Correct, ie, that is my suggestions.

> "Use P partitions if repartitioning is necessary"

I disagree here, because my reasoning is that:

 - if a user cares about the number of partition, the user wants to
enforce a repartitioning
 - if a user does not case about the number of partitions, we don't need
to provide them a way to pass in a "hint"

Hence, it should be sufficient to support:

// user does not care

  `stream.groupByKey(Grouped)`
  `stream.grouBy(..., Grouped)`

// user does care

  `stream.repartition(Repartitioned).groupByKey()`
  `streams.groupBy(..., Repartitioned)`



-Matthias


On 11/9/19 8:10 PM, John Roesler wrote:
> Thanks for those thoughts, Matthias,
> 
> I find your reasoning about the optimization behavior compelling. The
> `through` operation is very simple and clear to reason about. It just
> passes the data exactly at the specified point in the topology exactly
> through the specified topic. Likewise, if a user invokes a
> `repartition` operator, the simplest behavior is if we just do what
> they asked for.
> 
> Stepping back to think about when optimizations are surprising and
> when they aren't, it occurs to me that we should be free to move
> around repartitions when users have asked to perform some operation
> that implies a repartition, like "change keys, then filter, then
> aggregate". This program requires a repartition, but it could be
> anywhere between the key change and the aggregation. On the other
> hand, if they say, "change keys, then filter, then repartition, then
> aggregate", it seems like they were pretty clear about their desire,
> and we should just take it at face value.
> 
> So, I'm sold on just literally doing a repartition every time they
> invoke the `repartition` operator.
> 
> 
> The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.
> 
> What you said about `groupByKey` makes sense. If they key hasn't
> actually changed, then we don't need to repartition before
> aggregating. On the other hand, `groupBy` is specifically changing the
> key as part of the grouping operation, so (as you said) we definitely
> have to do a repartition.
> 
> If I'm reading the discussion right, it seems like we do want to allow
> people to optionally specify a partition count as part of this
> operation, but we don't want that option to _force_ repartitioning if
> it's not needed. That last clause is the key. "Use P partitions if
> repartitioning is necessary" is a directive that applies cleanly and
> correctly to both `groupBy` and `groupByKey`. What if we call the
> option `numberOfPartitionsHint`, which along with the "if necessary"
> javadoc, should make it clear that the option won't force a
> repartition, and also gives us enough latitude to still employ the
> optimizer on those repartition topics?
> 
> If we like the idea of expressing it as a "hint" for grouping and a
> "command" for `repartition`, then it seems like it still makes sense
> to keep Grouped and Repartitioned separate, as they would actually
> offer different methods with distinct semantics.
> 
> WDYT?
> 
> Thanks,
> -John
> 
> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <[email protected]> wrote:
>>
>> Sorry for late reply.
>>
>> I guess, the question boils down to the intended semantics of
>> `repartition()`. My understanding is as follows:
>>
>> - KS does auto-repartitioning for correctness reasons (using the
>> upstream topic to determine the number of partitions)
>> - KS does auto-repartitioning only for downstream DSL operators like
>> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
>> even if the stream is marked as `repartitioningRequired`).
>> - KS offers `through()` to enforce a repartitioning -- however, the user
>> needs to create the topic manually (with the desired number of partitions).
>>
>> I see two main applications for `repartitioning()`:
>>
>> 1) repartition data before a `transform()` but user does not want to
>> manage the topic
>> 2) scale out a downstream subtopology
>>
>> Hence, I see `repartition()` similar to `through()`: if a users calls
>> it, a repartitining is enforced, with the difference that KS manages the
>> topic and the user does not need to create it.
>>
>> This behavior makes (1) and (2) possible.
>>
>>> I think many users would prefer to just say "if there *is* a repartition
>>> required at this point in the topology, it should
>>> have N partitions"
>>
>> Because of (2), I disagree. Either a user does not care about scaling
>> out, for which case she would not specify the number of partitions. Or a
>> user does care, and hence wants to enforce the scale out. I don't think
>> that any user would say, "maybe scale out".
>>
>> Therefore, the optimizer should never ignore the repartition operation.
>> As a "consequence" (because repartitioning is expensive) a user should
>> make an explicit call to `repartition()` IMHO -- piggybacking an
>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
>> because it might be too subtle and an "optional scaling out" as laid out
>> above does not make sense IMHO.
>>
>> I am also not worried about "over repartitioning" because the result
>> stream would never trigger auto-repartitioning. Only if multiple
>> consecutive calls to `repartition()` are made it could be bad -- but
>> that's the same with `through()`. In the end, there is always some
>> responsibility on the user.
>>
>> Btw, for `.groupBy()` we know that repartitioning will be required,
>> however, for `groupByKey()` it depends if the KStream is marked as
>> `repartitioningRequired`.
>>
>> Hence, for `groupByKey()` it should not be possible for a user to set
>> number of partitions IMHO. For `groupBy()` it's a different story,
>> because calling
>>
>>    `repartition().groupBy()`
>>
>> does not achieve what we want. Hence, allowing users to pass in the
>> number of users partitions into `groupBy()` does actually makes sense,
>> because repartitioning will happen anyway and thus we can piggyback a
>> scaling decision.
>>
>> I think that John has a fair concern about the overloads, however, I am
>> not convinced that using `Grouped` to specify the number of partitions
>> is intuitive. I double checked `Grouped` and `Repartitioned` and both
>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
>> wondering if we could bridge the gap between both, if we would make
>> `Repartitioned extends Grouped`? For this case, we only need
>> `groupBy(Grouped)` and a user can pass in both types what seems to make
>> the API quite smooth:
>>
>>   `stream.groupBy(..., Grouped...)`
>>
>>   `stream.groupBy(..., Repartitioned...)`
>>
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>>
>>
>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>> Hi Sophie,
>>>
>>> Thank you for your reply, very insightful. Looking forward hearing others 
>>> opinion as well on this.
>>>
>>> Kind regards,
>>> Levani
>>>
>>>
>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <[email protected]> 
>>>> wrote:
>>>>
>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>> Kafka Streams has already
>>>>> optimizer in place which alters topology independently from user
>>>>
>>>> I agree (with you) and think this is a good way to put it -- we currently
>>>> auto-repartition for the user so
>>>> that they don't have to walk through their entire topology and reason about
>>>> when and where to place a
>>>> `.through` (or the new `.repartition`), so why suddenly force this onto the
>>>> user? How certain are we that
>>>> users will always get this right? It's easy to imagine that during
>>>> development, you write your new app with
>>>> correctly placed repartitions in order to use this new feature. During the
>>>> course of development you end up
>>>> tweaking the topology, but don't remember to review or move the
>>>> repartitioning since you're used to Streams
>>>> doing this for you. If you use only single-partition topics for testing,
>>>> you might not even notice your app is
>>>> spitting out incorrect results!
>>>>
>>>> Anyways, I feel pretty strongly that it would be weird to introduce a new
>>>> feature and say that to use it, you can't take
>>>> advantage of this other feature anymore. Also, is it possible our
>>>> optimization framework could ever include an
>>>> optimized repartitioning strategy that is better than what a user could
>>>> achieve by manually inserting repartitions?
>>>> Do we expect users to have a deep understanding of the best way to
>>>> repartition their particular topology, or is it
>>>> likely they will end up over-repartitioning either due to missed
>>>> optimizations or unnecessary extra repartitions?
>>>> I think many users would prefer to just say "if there *is* a repartition
>>>> required at this point in the topology, it should
>>>> have N partitions"
>>>>
>>>> As to the idea of adding `numberOfPartitions` to Grouped rather than
>>>> adding a new parameter to groupBy, that does seem more in line with the
>>>> current syntax so +1 from me
>>>>
>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>>>>> almost done, I want to resurrect discussion about this KIP to address
>>>>> couple of concerns raised by Matthias and John.
>>>>>
>>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>>>>> repartitioning and parallelism of sub-topologies by:
>>>>> 1) Introducing new KStream#repartition operation which is done in
>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170>
>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>>>>> be separate PR.
>>>>>
>>>>> While all agree about general implementation and idea behind
>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>>>> KStream#groupBy(Repartitioned) method overload raised some questions 
>>>>> during
>>>>> the review.
>>>>> Matthias raised concern that there can be cases when user uses
>>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>>>>> not required, thus configuration passed via `Repartitioned` would never be
>>>>> applied (Matthias, please correct me if I misinterpreted your comment).
>>>>> So instead, if user wants to control parallelism of sub-topologies, he or
>>>>> she should always use `KStream#repartition` operation before groupBy. Full
>>>>> comment can be seen here:
>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>
>>>>> On the same topic, John pointed out that, from API design perspective, we
>>>>> shouldn’t intertwine configuration classes of different operators between
>>>>> one another. So instead of introducing new 
>>>>> `KStream#groupBy(Repartitioned)`
>>>>> for specifying number of partitions for internal topic, we should update
>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>
>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>> Kafka Streams has already optimizer in place which alters topology
>>>>> independently from user. So maybe it makes sense if Kafka Streams,
>>>>> internally would optimize topology in the best way possible, even if in
>>>>> some cases this means ignoring some operator configurations passed by the
>>>>> user. Also, I agree with John about API design semantics. If we go through
>>>>> with the changes for `KStream#groupBy` operation, it makes more sense to
>>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>
>>>>> I would really appreciate communities feedback on this.
>>>>>
>>>>> Kind regards,
>>>>> Levani
>>>>>
>>>>>
>>>>>
>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hey Levani,
>>>>>>
>>>>>> I think people are busy with the upcoming 2.4 release, and don't have
>>>>> much
>>>>>> spare time at the
>>>>>> moment. It's kind of a difficult time to get attention on things, but
>>>>> feel
>>>>>> free to pick up something else
>>>>>> to work on in the meantime until things have calmed down a bit!
>>>>>>
>>>>>> Cheers,
>>>>>> Sophie
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>> [email protected] <mailto:[email protected]>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> Sorry for bringing this thread again, but I would like to get some
>>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>>>>> Please let me know if you have any concerns.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>>
>>>>>>>
>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>> [email protected] <mailto:[email protected]>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Here’s voting thread for this KIP:
>>>>>>> https://www.mail-archive.com/[email protected]/msg99680.html <
>>>>> https://www.mail-archive.com/[email protected]/msg99680.html> <
>>>>>>> https://www.mail-archive.com/[email protected]/msg99680.html <
>>>>> https://www.mail-archive.com/[email protected]/msg99680.html>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Levani
>>>>>>>>
>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>> [email protected] <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Matthias,
>>>>>>>>>
>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>>>>>
>>>>>>>>> Updated KIP
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <[email protected]
>>>>> <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>
>>>>>>>>>> One question:
>>>>>>>>>>
>>>>>>>>>> Why do we add
>>>>>>>>>>
>>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>>>>>
>>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>>>>> be
>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>
>>>>>>>>>> (I think it's valuable to keep the number of overload small if
>>>>>>> possible.)
>>>>>>>>>>
>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>>
>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>>>> ).
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>> [email protected] <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>>>>> strategy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>>>>> configs. My
>>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>>>>> `Produced`,
>>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>>>>> helps
>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>>>> abbreviations.
>>>>>>> Can
>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>
>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>
>>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>>>>> have
>>>>>>> some duplicate configurations between operators in order to keep API
>>>>> simple
>>>>>>> and usable.
>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>>>>> backward compatibility.
>>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>>>>> exciting, having Repartitioned class encapsulate some producer level
>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>>>>> separate.
>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>>>>> configuration. But
>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>>>>> configured
>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>>>>> writing to the
>>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>>>>> sense, these
>>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>>>>> true to
>>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>>>>> wrapped into a
>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>>>>> split out the
>>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>>>>> topic and
>>>>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>>>>> API surface
>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>>>>> is
>>>>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>>>>> about what
>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>>>>> Repartitioned,
>>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>>>>> Streams at
>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>>>> configurations (such
>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>>>>> couldn’t really
>>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>>>>> class and just
>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>>>>> semantics.
>>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>>>>> and
>>>>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>>>>> thing to do.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <[email protected]
>>>>> <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>>>>> the reason
>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>>>>> partitioning
>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>>>>> key
>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>>>>> the key
>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>>>>> in
>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>>>>> consistency.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>>>>> with the
>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>>>>> topics, user
>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>>>>> operation.
>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>> [email protected] <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>>>>> sure it
>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>>>>> rationale
>>>>>>> for
>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>>>>> first issue
>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>>>>> essentially
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>>>> [email protected] <mailto:[email protected]> <mailto:[email protected]
>>>>> <mailto:[email protected]>> <mailto:[email protected] <mailto:
>>>>> [email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>>>>> the
>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>>>>> get
>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>>>>> same
>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>>>>> not
>>>>>>> sure
>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>>>>> need control
>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>>>>> how to
>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>>>>> topics
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>>>>> specify
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>>>>> specifying the
>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>>>> [email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>>>>> would be a
>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>>>>> be
>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>>>>> the topic
>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>>>>> is
>>>>>>> missing
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>>>>> configuring
>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>>>> [email protected] <mailto:[email protected]>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>>>>> --
>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>>>>> will the
>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>>>>> for
>>>>>>> the
>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>>>>> will
>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>>>>> about
>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>>>>> confluence
>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>>>>> class for
>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>>>>> as
>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>>>>> already (can't
>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>>>>> to
>>>>>>> the
>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>>>>> operators
>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>>>>> indicate
>>>>>>> that
>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>>>>> use-defined
>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>>>>> operator that
>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>>>>> one
>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>>>>> sometimes
>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>>>>> wondering
>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>>>>> it is
>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>>>>> others on
>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>>>>> to
>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>>>>> assume that
>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>>>>> instead of
>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>>>>> configs,
>>>>>>> like
>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>>>>> need to
>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>>>>> them
>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>>>>> enhancing
>>>>>>> Produced
>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>>>>> Let me
>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>>>>> configuration,
>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>>>>> the final
>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>>>> assumption
>>>>>>> that user
>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>>>>> case, having
>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>>>> configurations
>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>>>>> Produced
>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>>>>> topics
>>>>>>> created
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>>>>> should leave
>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>>>>> should
>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>>>>> create and
>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>>>>> should be
>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>>>>> configuration in
>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>>>>> possibility to
>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>>>> Considering
>>>>>>> arguments
>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>>>>> options
>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>> num of
>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>> will
>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>>>>> new method
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>>>>> in one of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>>>>> following
>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>> Grouped
>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>> num of
>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>> will
>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>> Grouped
>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>>>>> creating and
>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>>>>> about this
>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>>>> [email protected]>>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>>>>> around
>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>>>>> understood, the
>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>>>>> method
>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>>>>> control
>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>
>>>>>
>>>
>>>
>>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to