Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Wed, 17 Jul 2019 10:15:58 -0700

Hello,

One more bump about KIP-221 
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
 
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>)
 so it doesn’t get lost in mailing list :)
Would love to hear communities opinions/concerns about this KIP.


Regards,
Levani


> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <levani.co...@gmail.com> wrote:
> 
> Hello,
> 
> Kind reminder about this KIP: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>  
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
> 
> Regards,
> Levani
> 
>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <levani.co...@gmail.com 
>> <mailto:levani.co...@gmail.com>> wrote:
>> 
>> Hello,
>> 
>> In order to move this KIP forward, I’ve updated confluence page with the new 
>> proposal 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>  
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>> I’ve also filled “Rejected Alternatives” section. 
>> 
>> Looking forward to discuss this KIP :)
>> 
>> King regards,
>> Levani
>> 
>> 
>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <levani.co...@gmail.com 
>>> <mailto:levani.co...@gmail.com>> wrote:
>>> 
>>> Hello Matthias,
>>> 
>>> Thanks for the feedback and ideas. 
>>> I like the idea of introducing dedicated `Topic` class for topic 
>>> configuration for internal operators like `groupedBy`.
>>> Would be great to hear others opinion about this as well.
>>> 
>>> Kind regards,
>>> Levani 
>>> 
>>> 
>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <matth...@confluent.io 
>>>> <mailto:matth...@confluent.io>> wrote:
>>>> 
>>>> Levani,
>>>> 
>>>> Thanks for picking up this KIP! And thanks for summarizing everything.
>>>> Even if some points may have been discussed already (can't really
>>>> remember), it's helpful to get a good summary to refresh the discussion.
>>>> 
>>>> I think your reasoning makes sense. With regard to the distinction
>>>> between operators that manage topics and operators that use user-created
>>>> topics: Following this argument, it might indicate that leaving
>>>> `through()` as-is (as an operator that uses use-defined topics) and
>>>> introducing a new `repartition()` operator (an operator that manages
>>>> topics itself) might be good. Otherwise, there is one operator
>>>> `through()` that sometimes manages topics but sometimes not; a different
>>>> name, ie, new operator would make the distinction clearer.
>>>> 
>>>> About adding `numOfPartitions` to `Grouped`. I am wondering if the same
>>>> argument as for `Produced` does apply and adding it is semantically
>>>> questionable? Might be good to get opinions of others on this, too. I am
>>>> not sure myself what solution I prefer atm.
>>>> 
>>>> So far, KS uses configuration objects that allow to configure a certain
>>>> "entity" like a consumer, producer, store. If we assume that a topic is
>>>> a similar entity, I am wonder if we should have a
>>>> `Topic#withNumberOfPartitions()` class and method instead of a plain
>>>> integer? This would allow us to add other configs, like replication
>>>> factor, retention-time etc, easily, without the need to change the "main
>>>> API".
>>>> 
>>>> Just want to give some ideas. Not sure if I like them myself. :)
>>>> 
>>>> 
>>>> -Matthias
>>>> 
>>>> 
>>>> 
>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>> Actually, giving it more though - maybe enhancing Produced with num of 
>>>>> partitions configuration is not the best approach. Let me explain why:
>>>>> 
>>>>> 1) If we enhance Produced class with this configuration, this will also 
>>>>> affect KStream#to operation. Since KStream#to is the final sink of the 
>>>>> topology, for me, it seems to be reasonable assumption that user needs to 
>>>>> manually create sink topic in advance. And in that case, having num of 
>>>>> partitions configuration doesn’t make much sense. 
>>>>> 
>>>>> 2) Looking at Produced class, based on API contract, seems like Produced 
>>>>> is designed to be something that is explicitly for producer (key 
>>>>> serializer, value serializer, partitioner those all are producer specific 
>>>>> configurations) and num of partitions is topic level configuration. And I 
>>>>> don’t think mixing topic and producer level configurations together in 
>>>>> one class is the good approach.
>>>>> 
>>>>> 3) Looking at KStream interface, seems like Produced parameter is for 
>>>>> operations that work with non-internal (e.g topics created and managed 
>>>>> internally by Kafka Streams) topics and I think we should leave it as it 
>>>>> is in that case.
>>>>> 
>>>>> Taking all this things into account, I think we should distinguish 
>>>>> between DSL operations, where Kafka Streams should create and manage 
>>>>> internal topics (KStream#groupBy) vs topics that should be created in 
>>>>> advance (e.g KStream#to).
>>>>> 
>>>>> To sum it up, I think adding numPartitions configuration in Produced will 
>>>>> result in mixing topic and producer level configuration in one class and 
>>>>> it’s gonna break existing API semantics.
>>>>> 
>>>>> Regarding making topic name optional in KStream#through - I think 
>>>>> underline idea is very useful and giving users possibility to specify num 
>>>>> of partitions there is even more useful :) Considering arguments against 
>>>>> adding num of partitions in Produced class, I see two options here:
>>>>> 1) Add following method overloads
>>>>>   * through() - topic will be auto-generated and num of partitions will 
>>>>> be taken from source topic
>>>>>   * through(final int numOfPartitions) - topic will be auto generated 
>>>>> with specified num of partitions
>>>>>   * through(final int numOfPartitions, final Produced<K, V> produced) - 
>>>>> topic will be with generated with specified num of partitions and 
>>>>> configuration taken from produced parameter.
>>>>> 2) Leave KStream#through as it is and introduce new method - 
>>>>> KStream#repartition (I think Matthias suggested this in one of the 
>>>>> threads)
>>>>> 
>>>>> Considering all mentioned above I propose the following plan:
>>>>> 
>>>>> Option A:
>>>>> 1) Leave Produced as it is
>>>>> 2) Add num of partitions configuration to Grouped class (as mentioned in 
>>>>> the KIP)
>>>>> 3) Add following method overloads to KStream#through
>>>>>   * through() - topic will be auto-generated and num of partitions will 
>>>>> be taken from source topic
>>>>>   * through(final int numOfPartitions) - topic will be auto generated 
>>>>> with specified num of partitions
>>>>>   * through(final int numOfPartitions, final Produced<K, V> produced) - 
>>>>> topic will be with generated with specified num of partitions and 
>>>>> configuration taken from produced parameter.
>>>>> 
>>>>> Option B:
>>>>> 1) Leave Produced as it is
>>>>> 2) Add num of partitions configuration to Grouped class (as mentioned in 
>>>>> the KIP)
>>>>> 3) Add new operator KStream#repartition for creating and managing 
>>>>> internal repartition topics
>>>>> 
>>>>> P.S. I’m sorry if all of this was already discussed in the mailing list, 
>>>>> but I kinda got with all the threads that were about this KIP :(
>>>>> 
>>>>> Kind regards,
>>>>> Levani
>>>>> 
>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <levani.co...@gmail.com 
>>>>>> <mailto:levani.co...@gmail.com>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I would like to resurrect discussion around KIP-221. Going through the 
>>>>>> discussion thread, there’s seems to agreement around usefulness of this 
>>>>>> feature. 
>>>>>> Regarding the implementation, as far as I understood, the most optimal 
>>>>>> solution for me seems the following:
>>>>>> 
>>>>>> 1) Add two method overloads to KStream#through method (essentially 
>>>>>> making topic name optional)
>>>>>> 2) Enhance Produced class with numOfPartitions configuration field.
>>>>>> 
>>>>>> Those two changes will allow DSL users to control parallelism and 
>>>>>> trigger re-partition without doing stateful operations.
>>>>>> 
>>>>>> I will update KIP with interface changes around KStream#through if this 
>>>>>> changes sound sensible.
>>>>>> 
>>>>>> Kind regards,
>>>>>> Levani
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to