Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Tue, 09 Jul 2019 01:38:51 -0700

Hello,

In order to move this KIP forward, I’ve updated confluence page with the new 
proposal 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
 
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
I’ve also filled “Rejected Alternatives” section.


Looking forward to discuss this KIP :)

King regards,
Levani


> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <[email protected]> wrote:
> 
> Hello Matthias,
> 
> Thanks for the feedback and ideas. 
> I like the idea of introducing dedicated `Topic` class for topic 
> configuration for internal operators like `groupedBy`.
> Would be great to hear others opinion about this as well.
> 
> Kind regards,
> Levani 
> 
> 
>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <[email protected]> wrote:
>> 
>> Levani,
>> 
>> Thanks for picking up this KIP! And thanks for summarizing everything.
>> Even if some points may have been discussed already (can't really
>> remember), it's helpful to get a good summary to refresh the discussion.
>> 
>> I think your reasoning makes sense. With regard to the distinction
>> between operators that manage topics and operators that use user-created
>> topics: Following this argument, it might indicate that leaving
>> `through()` as-is (as an operator that uses use-defined topics) and
>> introducing a new `repartition()` operator (an operator that manages
>> topics itself) might be good. Otherwise, there is one operator
>> `through()` that sometimes manages topics but sometimes not; a different
>> name, ie, new operator would make the distinction clearer.
>> 
>> About adding `numOfPartitions` to `Grouped`. I am wondering if the same
>> argument as for `Produced` does apply and adding it is semantically
>> questionable? Might be good to get opinions of others on this, too. I am
>> not sure myself what solution I prefer atm.
>> 
>> So far, KS uses configuration objects that allow to configure a certain
>> "entity" like a consumer, producer, store. If we assume that a topic is
>> a similar entity, I am wonder if we should have a
>> `Topic#withNumberOfPartitions()` class and method instead of a plain
>> integer? This would allow us to add other configs, like replication
>> factor, retention-time etc, easily, without the need to change the "main
>> API".
>> 
>> Just want to give some ideas. Not sure if I like them myself. :)
>> 
>> 
>> -Matthias
>> 
>> 
>> 
>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>> Actually, giving it more though - maybe enhancing Produced with num of 
>>> partitions configuration is not the best approach. Let me explain why:
>>> 
>>> 1) If we enhance Produced class with this configuration, this will also 
>>> affect KStream#to operation. Since KStream#to is the final sink of the 
>>> topology, for me, it seems to be reasonable assumption that user needs to 
>>> manually create sink topic in advance. And in that case, having num of 
>>> partitions configuration doesn’t make much sense. 
>>> 
>>> 2) Looking at Produced class, based on API contract, seems like Produced is 
>>> designed to be something that is explicitly for producer (key serializer, 
>>> value serializer, partitioner those all are producer specific 
>>> configurations) and num of partitions is topic level configuration. And I 
>>> don’t think mixing topic and producer level configurations together in one 
>>> class is the good approach.
>>> 
>>> 3) Looking at KStream interface, seems like Produced parameter is for 
>>> operations that work with non-internal (e.g topics created and managed 
>>> internally by Kafka Streams) topics and I think we should leave it as it is 
>>> in that case.
>>> 
>>> Taking all this things into account, I think we should distinguish between 
>>> DSL operations, where Kafka Streams should create and manage internal 
>>> topics (KStream#groupBy) vs topics that should be created in advance (e.g 
>>> KStream#to).
>>> 
>>> To sum it up, I think adding numPartitions configuration in Produced will 
>>> result in mixing topic and producer level configuration in one class and 
>>> it’s gonna break existing API semantics.
>>> 
>>> Regarding making topic name optional in KStream#through - I think underline 
>>> idea is very useful and giving users possibility to specify num of 
>>> partitions there is even more useful :) Considering arguments against 
>>> adding num of partitions in Produced class, I see two options here:
>>> 1) Add following method overloads
>>>     * through() - topic will be auto-generated and num of partitions will 
>>> be taken from source topic
>>>     * through(final int numOfPartitions) - topic will be auto generated 
>>> with specified num of partitions
>>>     * through(final int numOfPartitions, final Produced<K, V> produced) - 
>>> topic will be with generated with specified num of partitions and 
>>> configuration taken from produced parameter.
>>> 2) Leave KStream#through as it is and introduce new method - 
>>> KStream#repartition (I think Matthias suggested this in one of the threads)
>>> 
>>> Considering all mentioned above I propose the following plan:
>>> 
>>> Option A:
>>> 1) Leave Produced as it is
>>> 2) Add num of partitions configuration to Grouped class (as mentioned in 
>>> the KIP)
>>> 3) Add following method overloads to KStream#through
>>>     * through() - topic will be auto-generated and num of partitions will 
>>> be taken from source topic
>>>     * through(final int numOfPartitions) - topic will be auto generated 
>>> with specified num of partitions
>>>     * through(final int numOfPartitions, final Produced<K, V> produced) - 
>>> topic will be with generated with specified num of partitions and 
>>> configuration taken from produced parameter.
>>> 
>>> Option B:
>>> 1) Leave Produced as it is
>>> 2) Add num of partitions configuration to Grouped class (as mentioned in 
>>> the KIP)
>>> 3) Add new operator KStream#repartition for creating and managing internal 
>>> repartition topics
>>> 
>>> P.S. I’m sorry if all of this was already discussed in the mailing list, 
>>> but I kinda got with all the threads that were about this KIP :(
>>> 
>>> Kind regards,
>>> Levani
>>> 
>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <[email protected]> 
>>>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I would like to resurrect discussion around KIP-221. Going through the 
>>>> discussion thread, there’s seems to agreement around usefulness of this 
>>>> feature. 
>>>> Regarding the implementation, as far as I understood, the most optimal 
>>>> solution for me seems the following:
>>>> 
>>>> 1) Add two method overloads to KStream#through method (essentially making 
>>>> topic name optional)
>>>> 2) Enhance Produced class with numOfPartitions configuration field.
>>>> 
>>>> Those two changes will allow DSL users to control parallelism and trigger 
>>>> re-partition without doing stateful operations.
>>>> 
>>>> I will update KIP with interface changes around KStream#through if this 
>>>> changes sound sensible.
>>>> 
>>>> Kind regards,
>>>> Levani
>>> 
>>> 
>> 
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to