Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Wed, 17 Jul 2019 11:26:54 -0700

Hey Sophie,

In both cases KStream#repartition and KStream#repartition(Repartitioned) topic 
will be created and managed by Kafka Streams. 
Idea of Repartitioned is to give user more control over the topic such as num 
of partitions. 
I feel like Repartitioned parameter is something that is missing in current DSL 
design. 
Essentially giving user control over parallelism by configuring num of 
partitions for internal topics.


Hope this answers your question.

Regards,
Levani

> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <sop...@confluent.io> wrote:
> 
> Hey Levani,
> 
> Thanks for the KIP! Can you clarify one thing for me -- for the
> KStream#repartition signature taking a Repartitioned, will the topic be
> auto-created by Streams (which seems to be the case for the signature
> without a Repartitioned) or does it have to be pre-created? The wording in
> the KIP makes it seem like one version of the method will auto-create
> topics while the other will not.
> 
> Cheers,
> Sophie
> 
> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <levani.co...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> One more bump about KIP-221 (
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>)
>> so it doesn’t get lost in mailing list :)
>> Would love to hear communities opinions/concerns about this KIP.
>> 
>> Regards,
>> Levani
>> 
>> 
>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <levani.co...@gmail.com>
>> wrote:
>>> 
>>> Hello,
>>> 
>>> Kind reminder about this KIP:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>> 
>>> Regards,
>>> Levani
>>> 
>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <levani.co...@gmail.com
>> <mailto:levani.co...@gmail.com>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> In order to move this KIP forward, I’ve updated confluence page with
>> the new proposal
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>> I’ve also filled “Rejected Alternatives” section.
>>>> 
>>>> Looking forward to discuss this KIP :)
>>>> 
>>>> King regards,
>>>> Levani
>>>> 
>>>> 
>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <levani.co...@gmail.com
>> <mailto:levani.co...@gmail.com>> wrote:
>>>>> 
>>>>> Hello Matthias,
>>>>> 
>>>>> Thanks for the feedback and ideas.
>>>>> I like the idea of introducing dedicated `Topic` class for topic
>> configuration for internal operators like `groupedBy`.
>>>>> Would be great to hear others opinion about this as well.
>>>>> 
>>>>> Kind regards,
>>>>> Levani
>>>>> 
>>>>> 
>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <matth...@confluent.io
>> <mailto:matth...@confluent.io>> wrote:
>>>>>> 
>>>>>> Levani,
>>>>>> 
>>>>>> Thanks for picking up this KIP! And thanks for summarizing everything.
>>>>>> Even if some points may have been discussed already (can't really
>>>>>> remember), it's helpful to get a good summary to refresh the
>> discussion.
>>>>>> 
>>>>>> I think your reasoning makes sense. With regard to the distinction
>>>>>> between operators that manage topics and operators that use
>> user-created
>>>>>> topics: Following this argument, it might indicate that leaving
>>>>>> `through()` as-is (as an operator that uses use-defined topics) and
>>>>>> introducing a new `repartition()` operator (an operator that manages
>>>>>> topics itself) might be good. Otherwise, there is one operator
>>>>>> `through()` that sometimes manages topics but sometimes not; a
>> different
>>>>>> name, ie, new operator would make the distinction clearer.
>>>>>> 
>>>>>> About adding `numOfPartitions` to `Grouped`. I am wondering if the
>> same
>>>>>> argument as for `Produced` does apply and adding it is semantically
>>>>>> questionable? Might be good to get opinions of others on this, too. I
>> am
>>>>>> not sure myself what solution I prefer atm.
>>>>>> 
>>>>>> So far, KS uses configuration objects that allow to configure a
>> certain
>>>>>> "entity" like a consumer, producer, store. If we assume that a topic
>> is
>>>>>> a similar entity, I am wonder if we should have a
>>>>>> `Topic#withNumberOfPartitions()` class and method instead of a plain
>>>>>> integer? This would allow us to add other configs, like replication
>>>>>> factor, retention-time etc, easily, without the need to change the
>> "main
>>>>>> API".
>>>>>> 
>>>>>> Just want to give some ideas. Not sure if I like them myself. :)
>>>>>> 
>>>>>> 
>>>>>> -Matthias
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>> Actually, giving it more though - maybe enhancing Produced with num
>> of partitions configuration is not the best approach. Let me explain why:
>>>>>>> 
>>>>>>> 1) If we enhance Produced class with this configuration, this will
>> also affect KStream#to operation. Since KStream#to is the final sink of the
>> topology, for me, it seems to be reasonable assumption that user needs to
>> manually create sink topic in advance. And in that case, having num of
>> partitions configuration doesn’t make much sense.
>>>>>>> 
>>>>>>> 2) Looking at Produced class, based on API contract, seems like
>> Produced is designed to be something that is explicitly for producer (key
>> serializer, value serializer, partitioner those all are producer specific
>> configurations) and num of partitions is topic level configuration. And I
>> don’t think mixing topic and producer level configurations together in one
>> class is the good approach.
>>>>>>> 
>>>>>>> 3) Looking at KStream interface, seems like Produced parameter is
>> for operations that work with non-internal (e.g topics created and managed
>> internally by Kafka Streams) topics and I think we should leave it as it is
>> in that case.
>>>>>>> 
>>>>>>> Taking all this things into account, I think we should distinguish
>> between DSL operations, where Kafka Streams should create and manage
>> internal topics (KStream#groupBy) vs topics that should be created in
>> advance (e.g KStream#to).
>>>>>>> 
>>>>>>> To sum it up, I think adding numPartitions configuration in Produced
>> will result in mixing topic and producer level configuration in one class
>> and it’s gonna break existing API semantics.
>>>>>>> 
>>>>>>> Regarding making topic name optional in KStream#through - I think
>> underline idea is very useful and giving users possibility to specify num
>> of partitions there is even more useful :) Considering arguments against
>> adding num of partitions in Produced class, I see two options here:
>>>>>>> 1) Add following method overloads
>>>>>>>  * through() - topic will be auto-generated and num of partitions
>> will be taken from source topic
>>>>>>>  * through(final int numOfPartitions) - topic will be auto
>> generated with specified num of partitions
>>>>>>>  * through(final int numOfPartitions, final Produced<K, V>
>> produced) - topic will be with generated with specified num of partitions
>> and configuration taken from produced parameter.
>>>>>>> 2) Leave KStream#through as it is and introduce new method -
>> KStream#repartition (I think Matthias suggested this in one of the threads)
>>>>>>> 
>>>>>>> Considering all mentioned above I propose the following plan:
>>>>>>> 
>>>>>>> Option A:
>>>>>>> 1) Leave Produced as it is
>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>> mentioned in the KIP)
>>>>>>> 3) Add following method overloads to KStream#through
>>>>>>>  * through() - topic will be auto-generated and num of partitions
>> will be taken from source topic
>>>>>>>  * through(final int numOfPartitions) - topic will be auto
>> generated with specified num of partitions
>>>>>>>  * through(final int numOfPartitions, final Produced<K, V>
>> produced) - topic will be with generated with specified num of partitions
>> and configuration taken from produced parameter.
>>>>>>> 
>>>>>>> Option B:
>>>>>>> 1) Leave Produced as it is
>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>> mentioned in the KIP)
>>>>>>> 3) Add new operator KStream#repartition for creating and managing
>> internal repartition topics
>>>>>>> 
>>>>>>> P.S. I’m sorry if all of this was already discussed in the mailing
>> list, but I kinda got with all the threads that were about this KIP :(
>>>>>>> 
>>>>>>> Kind regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> I would like to resurrect discussion around KIP-221. Going through
>> the discussion thread, there’s seems to agreement around usefulness of this
>> feature.
>>>>>>>> Regarding the implementation, as far as I understood, the most
>> optimal solution for me seems the following:
>>>>>>>> 
>>>>>>>> 1) Add two method overloads to KStream#through method (essentially
>> making topic name optional)
>>>>>>>> 2) Enhance Produced class with numOfPartitions configuration field.
>>>>>>>> 
>>>>>>>> Those two changes will allow DSL users to control parallelism and
>> trigger re-partition without doing stateful operations.
>>>>>>>> 
>>>>>>>> I will update KIP with interface changes around KStream#through if
>> this changes sound sensible.
>>>>>>>> 
>>>>>>>> Kind regards,
>>>>>>>> Levani
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to