Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Sophie Blee-Goldman Wed, 17 Jul 2019 12:42:39 -0700

Thanks for clearing that up. I agree that Repartitioned would be a useful
addition. I'm wondering if it might also need to have
a withStreamPartitioner method/field, similar to Produced? I'm not sure how
widely this feature is really used, but seems it should be available for
repartition topics.


On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <[email protected]>
wrote:

> Hey Sophie,
>
> In both cases KStream#repartition and KStream#repartition(Repartitioned)
> topic will be created and managed by Kafka Streams.
> Idea of Repartitioned is to give user more control over the topic such as
> num of partitions.
> I feel like Repartitioned parameter is something that is missing in
> current DSL design.
> Essentially giving user control over parallelism by configuring num of
> partitions for internal topics.
>
> Hope this answers your question.
>
> Regards,
> Levani
>
> > On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <[email protected]>
> wrote:
> >
> > Hey Levani,
> >
> > Thanks for the KIP! Can you clarify one thing for me -- for the
> > KStream#repartition signature taking a Repartitioned, will the topic be
> > auto-created by Streams (which seems to be the case for the signature
> > without a Repartitioned) or does it have to be pre-created? The wording
> in
> > the KIP makes it seem like one version of the method will auto-create
> > topics while the other will not.
> >
> > Cheers,
> > Sophie
> >
> > On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
> [email protected]>
> > wrote:
> >
> >> Hello,
> >>
> >> One more bump about KIP-221 (
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >)
> >> so it doesn’t get lost in mailing list :)
> >> Would love to hear communities opinions/concerns about this KIP.
> >>
> >> Regards,
> >> Levani
> >>
> >>
> >>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <[email protected]
> >
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>> Kind reminder about this KIP:
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>
> >>>
> >>> Regards,
> >>> Levani
> >>>
> >>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
> [email protected]
> >> <mailto:[email protected]>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> In order to move this KIP forward, I’ve updated confluence page with
> >> the new proposal
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>
> >>>> I’ve also filled “Rejected Alternatives” section.
> >>>>
> >>>> Looking forward to discuss this KIP :)
> >>>>
> >>>> King regards,
> >>>> Levani
> >>>>
> >>>>
> >>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
> [email protected]
> >> <mailto:[email protected]>> wrote:
> >>>>>
> >>>>> Hello Matthias,
> >>>>>
> >>>>> Thanks for the feedback and ideas.
> >>>>> I like the idea of introducing dedicated `Topic` class for topic
> >> configuration for internal operators like `groupedBy`.
> >>>>> Would be great to hear others opinion about this as well.
> >>>>>
> >>>>> Kind regards,
> >>>>> Levani
> >>>>>
> >>>>>
> >>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <[email protected]
> >> <mailto:[email protected]>> wrote:
> >>>>>>
> >>>>>> Levani,
> >>>>>>
> >>>>>> Thanks for picking up this KIP! And thanks for summarizing
> everything.
> >>>>>> Even if some points may have been discussed already (can't really
> >>>>>> remember), it's helpful to get a good summary to refresh the
> >> discussion.
> >>>>>>
> >>>>>> I think your reasoning makes sense. With regard to the distinction
> >>>>>> between operators that manage topics and operators that use
> >> user-created
> >>>>>> topics: Following this argument, it might indicate that leaving
> >>>>>> `through()` as-is (as an operator that uses use-defined topics) and
> >>>>>> introducing a new `repartition()` operator (an operator that manages
> >>>>>> topics itself) might be good. Otherwise, there is one operator
> >>>>>> `through()` that sometimes manages topics but sometimes not; a
> >> different
> >>>>>> name, ie, new operator would make the distinction clearer.
> >>>>>>
> >>>>>> About adding `numOfPartitions` to `Grouped`. I am wondering if the
> >> same
> >>>>>> argument as for `Produced` does apply and adding it is semantically
> >>>>>> questionable? Might be good to get opinions of others on this, too.
> I
> >> am
> >>>>>> not sure myself what solution I prefer atm.
> >>>>>>
> >>>>>> So far, KS uses configuration objects that allow to configure a
> >> certain
> >>>>>> "entity" like a consumer, producer, store. If we assume that a topic
> >> is
> >>>>>> a similar entity, I am wonder if we should have a
> >>>>>> `Topic#withNumberOfPartitions()` class and method instead of a plain
> >>>>>> integer? This would allow us to add other configs, like replication
> >>>>>> factor, retention-time etc, easily, without the need to change the
> >> "main
> >>>>>> API".
> >>>>>>
> >>>>>> Just want to give some ideas. Not sure if I like them myself. :)
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> >>>>>>> Actually, giving it more though - maybe enhancing Produced with num
> >> of partitions configuration is not the best approach. Let me explain
> why:
> >>>>>>>
> >>>>>>> 1) If we enhance Produced class with this configuration, this will
> >> also affect KStream#to operation. Since KStream#to is the final sink of
> the
> >> topology, for me, it seems to be reasonable assumption that user needs
> to
> >> manually create sink topic in advance. And in that case, having num of
> >> partitions configuration doesn’t make much sense.
> >>>>>>>
> >>>>>>> 2) Looking at Produced class, based on API contract, seems like
> >> Produced is designed to be something that is explicitly for producer
> (key
> >> serializer, value serializer, partitioner those all are producer
> specific
> >> configurations) and num of partitions is topic level configuration. And
> I
> >> don’t think mixing topic and producer level configurations together in
> one
> >> class is the good approach.
> >>>>>>>
> >>>>>>> 3) Looking at KStream interface, seems like Produced parameter is
> >> for operations that work with non-internal (e.g topics created and
> managed
> >> internally by Kafka Streams) topics and I think we should leave it as
> it is
> >> in that case.
> >>>>>>>
> >>>>>>> Taking all this things into account, I think we should distinguish
> >> between DSL operations, where Kafka Streams should create and manage
> >> internal topics (KStream#groupBy) vs topics that should be created in
> >> advance (e.g KStream#to).
> >>>>>>>
> >>>>>>> To sum it up, I think adding numPartitions configuration in
> Produced
> >> will result in mixing topic and producer level configuration in one
> class
> >> and it’s gonna break existing API semantics.
> >>>>>>>
> >>>>>>> Regarding making topic name optional in KStream#through - I think
> >> underline idea is very useful and giving users possibility to specify
> num
> >> of partitions there is even more useful :) Considering arguments against
> >> adding num of partitions in Produced class, I see two options here:
> >>>>>>> 1) Add following method overloads
> >>>>>>>  * through() - topic will be auto-generated and num of partitions
> >> will be taken from source topic
> >>>>>>>  * through(final int numOfPartitions) - topic will be auto
> >> generated with specified num of partitions
> >>>>>>>  * through(final int numOfPartitions, final Produced<K, V>
> >> produced) - topic will be with generated with specified num of
> partitions
> >> and configuration taken from produced parameter.
> >>>>>>> 2) Leave KStream#through as it is and introduce new method -
> >> KStream#repartition (I think Matthias suggested this in one of the
> threads)
> >>>>>>>
> >>>>>>> Considering all mentioned above I propose the following plan:
> >>>>>>>
> >>>>>>> Option A:
> >>>>>>> 1) Leave Produced as it is
> >>>>>>> 2) Add num of partitions configuration to Grouped class (as
> >> mentioned in the KIP)
> >>>>>>> 3) Add following method overloads to KStream#through
> >>>>>>>  * through() - topic will be auto-generated and num of partitions
> >> will be taken from source topic
> >>>>>>>  * through(final int numOfPartitions) - topic will be auto
> >> generated with specified num of partitions
> >>>>>>>  * through(final int numOfPartitions, final Produced<K, V>
> >> produced) - topic will be with generated with specified num of
> partitions
> >> and configuration taken from produced parameter.
> >>>>>>>
> >>>>>>> Option B:
> >>>>>>> 1) Leave Produced as it is
> >>>>>>> 2) Add num of partitions configuration to Grouped class (as
> >> mentioned in the KIP)
> >>>>>>> 3) Add new operator KStream#repartition for creating and managing
> >> internal repartition topics
> >>>>>>>
> >>>>>>> P.S. I’m sorry if all of this was already discussed in the mailing
> >> list, but I kinda got with all the threads that were about this KIP :(
> >>>>>>>
> >>>>>>> Kind regards,
> >>>>>>> Levani
> >>>>>>>
> >>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
> >> [email protected] <mailto:[email protected]>> wrote:
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I would like to resurrect discussion around KIP-221. Going through
> >> the discussion thread, there’s seems to agreement around usefulness of
> this
> >> feature.
> >>>>>>>> Regarding the implementation, as far as I understood, the most
> >> optimal solution for me seems the following:
> >>>>>>>>
> >>>>>>>> 1) Add two method overloads to KStream#through method (essentially
> >> making topic name optional)
> >>>>>>>> 2) Enhance Produced class with numOfPartitions configuration
> field.
> >>>>>>>>
> >>>>>>>> Those two changes will allow DSL users to control parallelism and
> >> trigger re-partition without doing stateful operations.
> >>>>>>>>
> >>>>>>>> I will update KIP with interface changes around KStream#through if
> >> this changes sound sensible.
> >>>>>>>>
> >>>>>>>> Kind regards,
> >>>>>>>> Levani
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to