Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Wed, 17 Jul 2019 12:49:18 -0700

Yes, I was thinking about it as well. To be honest I’m not sure about it yet.
As Kafka Streams DSL user, I don’t really think I would need control over 
partitioner for internal topics.
As a user, I would assume that Kafka Streams knows best how to partition data 
for internal topics.
In this KIP I wrote that Produced should be used only for topics that are 
created by user In advance. 
In those cases maybe it make sense to have possibility to specify the 
partitioner.
I don’t have clear answer on that yet, but I guess specifying the partitioner 
can be added as well if there’s agreement on this.


Regards,
Levani

> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <[email protected]> wrote:
> 
> Thanks for clearing that up. I agree that Repartitioned would be a useful
> addition. I'm wondering if it might also need to have
> a withStreamPartitioner method/field, similar to Produced? I'm not sure how
> widely this feature is really used, but seems it should be available for
> repartition topics.
> 
> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <[email protected]>
> wrote:
> 
>> Hey Sophie,
>> 
>> In both cases KStream#repartition and KStream#repartition(Repartitioned)
>> topic will be created and managed by Kafka Streams.
>> Idea of Repartitioned is to give user more control over the topic such as
>> num of partitions.
>> I feel like Repartitioned parameter is something that is missing in
>> current DSL design.
>> Essentially giving user control over parallelism by configuring num of
>> partitions for internal topics.
>> 
>> Hope this answers your question.
>> 
>> Regards,
>> Levani
>> 
>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <[email protected]>
>> wrote:
>>> 
>>> Hey Levani,
>>> 
>>> Thanks for the KIP! Can you clarify one thing for me -- for the
>>> KStream#repartition signature taking a Repartitioned, will the topic be
>>> auto-created by Streams (which seems to be the case for the signature
>>> without a Repartitioned) or does it have to be pre-created? The wording
>> in
>>> the KIP makes it seem like one version of the method will auto-create
>>> topics while the other will not.
>>> 
>>> Cheers,
>>> Sophie
>>> 
>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>> [email protected]>
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> One more bump about KIP-221 (
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> )
>>>> so it doesn’t get lost in mailing list :)
>>>> Would love to hear communities opinions/concerns about this KIP.
>>>> 
>>>> Regards,
>>>> Levani
>>>> 
>>>> 
>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <[email protected]
>>> 
>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> Kind reminder about this KIP:
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Levani
>>>>> 
>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>> [email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> In order to move this KIP forward, I’ve updated confluence page with
>>>> the new proposal
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> 
>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>> 
>>>>>> Looking forward to discuss this KIP :)
>>>>>> 
>>>>>> King regards,
>>>>>> Levani
>>>>>> 
>>>>>> 
>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>> [email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Hello Matthias,
>>>>>>> 
>>>>>>> Thanks for the feedback and ideas.
>>>>>>> I like the idea of introducing dedicated `Topic` class for topic
>>>> configuration for internal operators like `groupedBy`.
>>>>>>> Would be great to hear others opinion about this as well.
>>>>>>> 
>>>>>>> Kind regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> Levani,
>>>>>>>> 
>>>>>>>> Thanks for picking up this KIP! And thanks for summarizing
>> everything.
>>>>>>>> Even if some points may have been discussed already (can't really
>>>>>>>> remember), it's helpful to get a good summary to refresh the
>>>> discussion.
>>>>>>>> 
>>>>>>>> I think your reasoning makes sense. With regard to the distinction
>>>>>>>> between operators that manage topics and operators that use
>>>> user-created
>>>>>>>> topics: Following this argument, it might indicate that leaving
>>>>>>>> `through()` as-is (as an operator that uses use-defined topics) and
>>>>>>>> introducing a new `repartition()` operator (an operator that manages
>>>>>>>> topics itself) might be good. Otherwise, there is one operator
>>>>>>>> `through()` that sometimes manages topics but sometimes not; a
>>>> different
>>>>>>>> name, ie, new operator would make the distinction clearer.
>>>>>>>> 
>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am wondering if the
>>>> same
>>>>>>>> argument as for `Produced` does apply and adding it is semantically
>>>>>>>> questionable? Might be good to get opinions of others on this, too.
>> I
>>>> am
>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>> 
>>>>>>>> So far, KS uses configuration objects that allow to configure a
>>>> certain
>>>>>>>> "entity" like a consumer, producer, store. If we assume that a topic
>>>> is
>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>> `Topic#withNumberOfPartitions()` class and method instead of a plain
>>>>>>>> integer? This would allow us to add other configs, like replication
>>>>>>>> factor, retention-time etc, easily, without the need to change the
>>>> "main
>>>>>>>> API".
>>>>>>>> 
>>>>>>>> Just want to give some ideas. Not sure if I like them myself. :)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Matthias
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>> Actually, giving it more though - maybe enhancing Produced with num
>>>> of partitions configuration is not the best approach. Let me explain
>> why:
>>>>>>>>> 
>>>>>>>>> 1) If we enhance Produced class with this configuration, this will
>>>> also affect KStream#to operation. Since KStream#to is the final sink of
>> the
>>>> topology, for me, it seems to be reasonable assumption that user needs
>> to
>>>> manually create sink topic in advance. And in that case, having num of
>>>> partitions configuration doesn’t make much sense.
>>>>>>>>> 
>>>>>>>>> 2) Looking at Produced class, based on API contract, seems like
>>>> Produced is designed to be something that is explicitly for producer
>> (key
>>>> serializer, value serializer, partitioner those all are producer
>> specific
>>>> configurations) and num of partitions is topic level configuration. And
>> I
>>>> don’t think mixing topic and producer level configurations together in
>> one
>>>> class is the good approach.
>>>>>>>>> 
>>>>>>>>> 3) Looking at KStream interface, seems like Produced parameter is
>>>> for operations that work with non-internal (e.g topics created and
>> managed
>>>> internally by Kafka Streams) topics and I think we should leave it as
>> it is
>>>> in that case.
>>>>>>>>> 
>>>>>>>>> Taking all this things into account, I think we should distinguish
>>>> between DSL operations, where Kafka Streams should create and manage
>>>> internal topics (KStream#groupBy) vs topics that should be created in
>>>> advance (e.g KStream#to).
>>>>>>>>> 
>>>>>>>>> To sum it up, I think adding numPartitions configuration in
>> Produced
>>>> will result in mixing topic and producer level configuration in one
>> class
>>>> and it’s gonna break existing API semantics.
>>>>>>>>> 
>>>>>>>>> Regarding making topic name optional in KStream#through - I think
>>>> underline idea is very useful and giving users possibility to specify
>> num
>>>> of partitions there is even more useful :) Considering arguments against
>>>> adding num of partitions in Produced class, I see two options here:
>>>>>>>>> 1) Add following method overloads
>>>>>>>>> * through() - topic will be auto-generated and num of partitions
>>>> will be taken from source topic
>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>> generated with specified num of partitions
>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>> produced) - topic will be with generated with specified num of
>> partitions
>>>> and configuration taken from produced parameter.
>>>>>>>>> 2) Leave KStream#through as it is and introduce new method -
>>>> KStream#repartition (I think Matthias suggested this in one of the
>> threads)
>>>>>>>>> 
>>>>>>>>> Considering all mentioned above I propose the following plan:
>>>>>>>>> 
>>>>>>>>> Option A:
>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>>>> mentioned in the KIP)
>>>>>>>>> 3) Add following method overloads to KStream#through
>>>>>>>>> * through() - topic will be auto-generated and num of partitions
>>>> will be taken from source topic
>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>> generated with specified num of partitions
>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>> produced) - topic will be with generated with specified num of
>> partitions
>>>> and configuration taken from produced parameter.
>>>>>>>>> 
>>>>>>>>> Option B:
>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>>>> mentioned in the KIP)
>>>>>>>>> 3) Add new operator KStream#repartition for creating and managing
>>>> internal repartition topics
>>>>>>>>> 
>>>>>>>>> P.S. I’m sorry if all of this was already discussed in the mailing
>>>> list, but I kinda got with all the threads that were about this KIP :(
>>>>>>>>> 
>>>>>>>>> Kind regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>> [email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> I would like to resurrect discussion around KIP-221. Going through
>>>> the discussion thread, there’s seems to agreement around usefulness of
>> this
>>>> feature.
>>>>>>>>>> Regarding the implementation, as far as I understood, the most
>>>> optimal solution for me seems the following:
>>>>>>>>>> 
>>>>>>>>>> 1) Add two method overloads to KStream#through method (essentially
>>>> making topic name optional)
>>>>>>>>>> 2) Enhance Produced class with numOfPartitions configuration
>> field.
>>>>>>>>>> 
>>>>>>>>>> Those two changes will allow DSL users to control parallelism and
>>>> trigger re-partition without doing stateful operations.
>>>>>>>>>> 
>>>>>>>>>> I will update KIP with interface changes around KStream#through if
>>>> this changes sound sensible.
>>>>>>>>>> 
>>>>>>>>>> Kind regards,
>>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to