Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Matthias J. Sax Sat, 09 Nov 2019 18:28:36 -0800

Sorry for late reply.

I guess, the question boils down to the intended semantics of
`repartition()`. My understanding is as follows:


- KS does auto-repartitioning for correctness reasons (using the
upstream topic to determine the number of partitions)
- KS does auto-repartitioning only for downstream DSL operators like
`count()` (eg, a `transform()` does never trigger an auto-repartitioning
even if the stream is marked as `repartitioningRequired`).
- KS offers `through()` to enforce a repartitioning -- however, the user
needs to create the topic manually (with the desired number of partitions).

I see two main applications for `repartitioning()`:

1) repartition data before a `transform()` but user does not want to
manage the topic
2) scale out a downstream subtopology

Hence, I see `repartition()` similar to `through()`: if a users calls
it, a repartitining is enforced, with the difference that KS manages the
topic and the user does not need to create it.

This behavior makes (1) and (2) possible.

> I think many users would prefer to just say "if there *is* a repartition
> required at this point in the topology, it should
> have N partitions"

Because of (2), I disagree. Either a user does not care about scaling
out, for which case she would not specify the number of partitions. Or a
user does care, and hence wants to enforce the scale out. I don't think
that any user would say, "maybe scale out".

Therefore, the optimizer should never ignore the repartition operation.
As a "consequence" (because repartitioning is expensive) a user should
make an explicit call to `repartition()` IMHO -- piggybacking an
enforced repartitioning into `groupByKey()` seems to be "dangerous"
because it might be too subtle and an "optional scaling out" as laid out
above does not make sense IMHO.

I am also not worried about "over repartitioning" because the result
stream would never trigger auto-repartitioning. Only if multiple
consecutive calls to `repartition()` are made it could be bad -- but
that's the same with `through()`. In the end, there is always some
responsibility on the user.

Btw, for `.groupBy()` we know that repartitioning will be required,
however, for `groupByKey()` it depends if the KStream is marked as
`repartitioningRequired`.

Hence, for `groupByKey()` it should not be possible for a user to set
number of partitions IMHO. For `groupBy()` it's a different story,
because calling

   `repartition().groupBy()`

does not achieve what we want. Hence, allowing users to pass in the
number of users partitions into `groupBy()` does actually makes sense,
because repartitioning will happen anyway and thus we can piggyback a
scaling decision.

I think that John has a fair concern about the overloads, however, I am
not convinced that using `Grouped` to specify the number of partitions
is intuitive. I double checked `Grouped` and `Repartitioned` and both
allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
wondering if we could bridge the gap between both, if we would make
`Repartitioned extends Grouped`? For this case, we only need
`groupBy(Grouped)` and a user can pass in both types what seems to make
the API quite smooth:

  `stream.groupBy(..., Grouped...)`

  `stream.groupBy(..., Repartitioned...)`


Thoughts?


-Matthias



On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> Hi Sophie,
> 
> Thank you for your reply, very insightful. Looking forward hearing others 
> opinion as well on this.
> 
> Kind regards,
> Levani
> 
> 
>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <[email protected]> wrote:
>>
>>> Personally, I think Matthias’s concern is valid, but on the other hand
>> Kafka Streams has already
>>> optimizer in place which alters topology independently from user
>>
>> I agree (with you) and think this is a good way to put it -- we currently
>> auto-repartition for the user so
>> that they don't have to walk through their entire topology and reason about
>> when and where to place a
>> `.through` (or the new `.repartition`), so why suddenly force this onto the
>> user? How certain are we that
>> users will always get this right? It's easy to imagine that during
>> development, you write your new app with
>> correctly placed repartitions in order to use this new feature. During the
>> course of development you end up
>> tweaking the topology, but don't remember to review or move the
>> repartitioning since you're used to Streams
>> doing this for you. If you use only single-partition topics for testing,
>> you might not even notice your app is
>> spitting out incorrect results!
>>
>> Anyways, I feel pretty strongly that it would be weird to introduce a new
>> feature and say that to use it, you can't take
>> advantage of this other feature anymore. Also, is it possible our
>> optimization framework could ever include an
>> optimized repartitioning strategy that is better than what a user could
>> achieve by manually inserting repartitions?
>> Do we expect users to have a deep understanding of the best way to
>> repartition their particular topology, or is it
>> likely they will end up over-repartitioning either due to missed
>> optimizations or unnecessary extra repartitions?
>> I think many users would prefer to just say "if there *is* a repartition
>> required at this point in the topology, it should
>> have N partitions"
>>
>> As to the idea of adding `numberOfPartitions` to Grouped rather than
>> adding a new parameter to groupBy, that does seem more in line with the
>> current syntax so +1 from me
>>
>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <[email protected]>
>> wrote:
>>
>>> Hello all,
>>>
>>> While https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>>> almost done, I want to resurrect discussion about this KIP to address
>>> couple of concerns raised by Matthias and John.
>>>
>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>>> repartitioning and parallelism of sub-topologies by:
>>> 1) Introducing new KStream#repartition operation which is done in
>>> https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170>
>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>>> be separate PR.
>>>
>>> While all agree about general implementation and idea behind
>>> https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>> KStream#groupBy(Repartitioned) method overload raised some questions during
>>> the review.
>>> Matthias raised concern that there can be cases when user uses
>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>>> not required, thus configuration passed via `Repartitioned` would never be
>>> applied (Matthias, please correct me if I misinterpreted your comment).
>>> So instead, if user wants to control parallelism of sub-topologies, he or
>>> she should always use `KStream#repartition` operation before groupBy. Full
>>> comment can be seen here:
>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>
>>> On the same topic, John pointed out that, from API design perspective, we
>>> shouldn’t intertwine configuration classes of different operators between
>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
>>> for specifying number of partitions for internal topic, we should update
>>> existing `Grouped` class with `numberOfPartitions` field.
>>>
>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>> Kafka Streams has already optimizer in place which alters topology
>>> independently from user. So maybe it makes sense if Kafka Streams,
>>> internally would optimize topology in the best way possible, even if in
>>> some cases this means ignoring some operator configurations passed by the
>>> user. Also, I agree with John about API design semantics. If we go through
>>> with the changes for `KStream#groupBy` operation, it makes more sense to
>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>
>>> I would really appreciate communities feedback on this.
>>>
>>> Kind regards,
>>> Levani
>>>
>>>
>>>
>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <[email protected]>
>>> wrote:
>>>>
>>>> Hey Levani,
>>>>
>>>> I think people are busy with the upcoming 2.4 release, and don't have
>>> much
>>>> spare time at the
>>>> moment. It's kind of a difficult time to get attention on things, but
>>> feel
>>>> free to pick up something else
>>>> to work on in the meantime until things have calmed down a bit!
>>>>
>>>> Cheers,
>>>> Sophie
>>>>
>>>>
>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>> [email protected] <mailto:[email protected]>>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> Sorry for bringing this thread again, but I would like to get some
>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170> <
>>>>> https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170>>
>>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>>> Please let me know if you have any concerns.
>>>>>
>>>>> Regards,
>>>>> Levani
>>>>>
>>>>>
>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>> [email protected] <mailto:[email protected]>>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Here’s voting thread for this KIP:
>>>>> https://www.mail-archive.com/[email protected]/msg99680.html <
>>> https://www.mail-archive.com/[email protected]/msg99680.html> <
>>>>> https://www.mail-archive.com/[email protected]/msg99680.html <
>>> https://www.mail-archive.com/[email protected]/msg99680.html>>
>>>>>>
>>>>>> Regards,
>>>>>> Levani
>>>>>>
>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>
>>>>>>> Hi Matthias,
>>>>>>>
>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>>>
>>>>>>> Updated KIP
>>>>>>>
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>>
>>>>>>>
>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <[email protected]
>>> <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>
>>>>>>>> One question:
>>>>>>>>
>>>>>>>> Why do we add
>>>>>>>>
>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>>>
>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>>> be
>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>
>>>>>>>> (I think it's valuable to keep the number of overload small if
>>>>> possible.)
>>>>>>>>
>>>>>>>> Otherwise LGTM.
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Thanks all for your feedback.
>>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>>> concerns about this KIP, please let me know.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>>
>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Matthias,
>>>>>>>>>>
>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>> I’ve updated KIP (
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>>>> ).
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Levani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>
>>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>>> strategy.
>>>>>>>>>>>
>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>>> configs. My
>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>>> `Produced`,
>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>>> helps
>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>> abbreviations.
>>>>> Can
>>>>>>>>>>> we rename:
>>>>>>>>>>>
>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>
>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>>>
>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>>> have
>>>>> some duplicate configurations between operators in order to keep API
>>> simple
>>>>> and usable.
>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>>> backward compatibility.
>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>>> exciting, having Repartitioned class encapsulate some producer level
>>>>> configs makes API more readable.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>>> separate.
>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>>> configuration. But
>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>>> configured
>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>>> writing to the
>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>>> sense, these
>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>>> true to
>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>>> wrapped into a
>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>>> split out the
>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>>> topic and
>>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>>> API surface
>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>>> is
>>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>>> about what
>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>>> Repartitioned,
>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>>> Streams at
>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>> configurations (such
>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>>> couldn’t really
>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>>> class and just
>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>>> semantics.
>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>>> and
>>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>>> thing to do.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <[email protected]
>>> <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>>> the reason
>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>>> partitioning
>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>>> key
>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>>> the key
>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>>> in
>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>>> consistency.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>>> with the
>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>>> topics, user
>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>>> operation.
>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>> <mailto:
>>> [email protected] <mailto:[email protected]> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>>> sure it
>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>>> rationale
>>>>> for
>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>>> first issue
>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>>> essentially
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>> [email protected] <mailto:[email protected]> <mailto:[email protected]
>>> <mailto:[email protected]>> <mailto:[email protected] <mailto:
>>> [email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>>> the
>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>>> get
>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>>> right-side
>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>>> same
>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>>> important?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>>> not
>>>>> sure
>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>>> need control
>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>>> how to
>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>>> topics
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>>> specify
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>>> specifying the
>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>>> this.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>>> would be a
>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>>> Produced? I'm not
>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>>> be
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>>> the topic
>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>>> is
>>>>> missing
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>>> configuring
>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>> [email protected] <mailto:[email protected]>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>>> --
>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>>> will the
>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>>> for
>>>>> the
>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>>> pre-created? The
>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>>> will
>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>>> about
>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>>> confluence
>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>>> class for
>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>>> as
>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>>> already (can't
>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>>> to
>>>>> the
>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>>> operators
>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>>> indicate
>>>>> that
>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>>> use-defined
>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>>> operator that
>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>>> one
>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>>> sometimes
>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>>> wondering
>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>>> it is
>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>>> others on
>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>>> to
>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>>> assume that
>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>>> instead of
>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>>> configs,
>>>>> like
>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>>> need to
>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>>> them
>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>>> enhancing
>>>>> Produced
>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>>> Let me
>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>>> configuration,
>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>>> the final
>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>> assumption
>>>>> that user
>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>>> case, having
>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>>> contract, seems
>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>>> explicitly for
>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>> configurations
>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>>> Produced
>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>>> topics
>>>>> created
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>>> should leave
>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>>> should
>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>>> create and
>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>>> should be
>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>>> configuration in
>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>>> KStream#through - I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>>> possibility to
>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>> Considering
>>>>> arguments
>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>>> options
>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>> num of
>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>> will
>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>>> new method
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>>> in one of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>>> following
>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>> Grouped
>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>> num of
>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>> will
>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>> Grouped
>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>>> creating and
>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>>> discussed in the
>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>>> about this
>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>> [email protected]>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>>> KIP-221. Going
>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>>> around
>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>>> understood, the
>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>>> method
>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>>> control
>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>
>>>
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to