Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Matthias J. Sax Tue, 02 Jul 2019 21:00:58 -0700

Levani,

Thanks for picking up this KIP! And thanks for summarizing everything.
Even if some points may have been discussed already (can't really
remember), it's helpful to get a good summary to refresh the discussion.


I think your reasoning makes sense. With regard to the distinction
between operators that manage topics and operators that use user-created
topics: Following this argument, it might indicate that leaving
`through()` as-is (as an operator that uses use-defined topics) and
introducing a new `repartition()` operator (an operator that manages
topics itself) might be good. Otherwise, there is one operator
`through()` that sometimes manages topics but sometimes not; a different
name, ie, new operator would make the distinction clearer.

About adding `numOfPartitions` to `Grouped`. I am wondering if the same
argument as for `Produced` does apply and adding it is semantically
questionable? Might be good to get opinions of others on this, too. I am
not sure myself what solution I prefer atm.

So far, KS uses configuration objects that allow to configure a certain
"entity" like a consumer, producer, store. If we assume that a topic is
a similar entity, I am wonder if we should have a
`Topic#withNumberOfPartitions()` class and method instead of a plain
integer? This would allow us to add other configs, like replication
factor, retention-time etc, easily, without the need to change the "main
API".

Just want to give some ideas. Not sure if I like them myself. :)


-Matthias



On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> Actually, giving it more though - maybe enhancing Produced with num of 
> partitions configuration is not the best approach. Let me explain why:
> 
> 1) If we enhance Produced class with this configuration, this will also 
> affect KStream#to operation. Since KStream#to is the final sink of the 
> topology, for me, it seems to be reasonable assumption that user needs to 
> manually create sink topic in advance. And in that case, having num of 
> partitions configuration doesn’t make much sense. 
> 
> 2) Looking at Produced class, based on API contract, seems like Produced is 
> designed to be something that is explicitly for producer (key serializer, 
> value serializer, partitioner those all are producer specific configurations) 
> and num of partitions is topic level configuration. And I don’t think mixing 
> topic and producer level configurations together in one class is the good 
> approach.
> 
> 3) Looking at KStream interface, seems like Produced parameter is for 
> operations that work with non-internal (e.g topics created and managed 
> internally by Kafka Streams) topics and I think we should leave it as it is 
> in that case.
> 
> Taking all this things into account, I think we should distinguish between 
> DSL operations, where Kafka Streams should create and manage internal topics 
> (KStream#groupBy) vs topics that should be created in advance (e.g 
> KStream#to).
> 
> To sum it up, I think adding numPartitions configuration in Produced will 
> result in mixing topic and producer level configuration in one class and it’s 
> gonna break existing API semantics.
> 
> Regarding making topic name optional in KStream#through - I think underline 
> idea is very useful and giving users possibility to specify num of partitions 
> there is even more useful :) Considering arguments against adding num of 
> partitions in Produced class, I see two options here:
> 1) Add following method overloads
>       * through() - topic will be auto-generated and num of partitions will 
> be taken from source topic
>       * through(final int numOfPartitions) - topic will be auto generated 
> with specified num of partitions
>       * through(final int numOfPartitions, final Produced<K, V> produced) - 
> topic will be with generated with specified num of partitions and 
> configuration taken from produced parameter.
> 2) Leave KStream#through as it is and introduce new method - 
> KStream#repartition (I think Matthias suggested this in one of the threads)
> 
> Considering all mentioned above I propose the following plan:
> 
> Option A:
> 1) Leave Produced as it is
> 2) Add num of partitions configuration to Grouped class (as mentioned in the 
> KIP)
> 3) Add following method overloads to KStream#through
>       * through() - topic will be auto-generated and num of partitions will 
> be taken from source topic
>       * through(final int numOfPartitions) - topic will be auto generated 
> with specified num of partitions
>       * through(final int numOfPartitions, final Produced<K, V> produced) - 
> topic will be with generated with specified num of partitions and 
> configuration taken from produced parameter.
> 
> Option B:
> 1) Leave Produced as it is
> 2) Add num of partitions configuration to Grouped class (as mentioned in the 
> KIP)
> 3) Add new operator KStream#repartition for creating and managing internal 
> repartition topics
> 
> P.S. I’m sorry if all of this was already discussed in the mailing list, but 
> I kinda got with all the threads that were about this KIP :(
> 
> Kind regards,
> Levani
> 
>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <[email protected]> wrote:
>>
>> Hello,
>>
>> I would like to resurrect discussion around KIP-221. Going through the 
>> discussion thread, there’s seems to agreement around usefulness of this 
>> feature. 
>> Regarding the implementation, as far as I understood, the most optimal 
>> solution for me seems the following:
>>
>> 1) Add two method overloads to KStream#through method (essentially making 
>> topic name optional)
>> 2) Enhance Produced class with numOfPartitions configuration field.
>>
>> Those two changes will allow DSL users to control parallelism and trigger 
>> re-partition without doing stateful operations.
>>
>> I will update KIP with interface changes around KStream#through if this 
>> changes sound sensible.
>>
>> Kind regards,
>> Levani
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to