Levani, Thanks for picking up this KIP! And thanks for summarizing everything. Even if some points may have been discussed already (can't really remember), it's helpful to get a good summary to refresh the discussion.
I think your reasoning makes sense. With regard to the distinction between operators that manage topics and operators that use user-created topics: Following this argument, it might indicate that leaving `through()` as-is (as an operator that uses use-defined topics) and introducing a new `repartition()` operator (an operator that manages topics itself) might be good. Otherwise, there is one operator `through()` that sometimes manages topics but sometimes not; a different name, ie, new operator would make the distinction clearer. About adding `numOfPartitions` to `Grouped`. I am wondering if the same argument as for `Produced` does apply and adding it is semantically questionable? Might be good to get opinions of others on this, too. I am not sure myself what solution I prefer atm. So far, KS uses configuration objects that allow to configure a certain "entity" like a consumer, producer, store. If we assume that a topic is a similar entity, I am wonder if we should have a `Topic#withNumberOfPartitions()` class and method instead of a plain integer? This would allow us to add other configs, like replication factor, retention-time etc, easily, without the need to change the "main API". Just want to give some ideas. Not sure if I like them myself. :) -Matthias On 7/1/19 1:04 AM, Levani Kokhreidze wrote: > Actually, giving it more though - maybe enhancing Produced with num of > partitions configuration is not the best approach. Let me explain why: > > 1) If we enhance Produced class with this configuration, this will also > affect KStream#to operation. Since KStream#to is the final sink of the > topology, for me, it seems to be reasonable assumption that user needs to > manually create sink topic in advance. And in that case, having num of > partitions configuration doesn’t make much sense. > > 2) Looking at Produced class, based on API contract, seems like Produced is > designed to be something that is explicitly for producer (key serializer, > value serializer, partitioner those all are producer specific configurations) > and num of partitions is topic level configuration. And I don’t think mixing > topic and producer level configurations together in one class is the good > approach. > > 3) Looking at KStream interface, seems like Produced parameter is for > operations that work with non-internal (e.g topics created and managed > internally by Kafka Streams) topics and I think we should leave it as it is > in that case. > > Taking all this things into account, I think we should distinguish between > DSL operations, where Kafka Streams should create and manage internal topics > (KStream#groupBy) vs topics that should be created in advance (e.g > KStream#to). > > To sum it up, I think adding numPartitions configuration in Produced will > result in mixing topic and producer level configuration in one class and it’s > gonna break existing API semantics. > > Regarding making topic name optional in KStream#through - I think underline > idea is very useful and giving users possibility to specify num of partitions > there is even more useful :) Considering arguments against adding num of > partitions in Produced class, I see two options here: > 1) Add following method overloads > * through() - topic will be auto-generated and num of partitions will > be taken from source topic > * through(final int numOfPartitions) - topic will be auto generated > with specified num of partitions > * through(final int numOfPartitions, final Produced<K, V> produced) - > topic will be with generated with specified num of partitions and > configuration taken from produced parameter. > 2) Leave KStream#through as it is and introduce new method - > KStream#repartition (I think Matthias suggested this in one of the threads) > > Considering all mentioned above I propose the following plan: > > Option A: > 1) Leave Produced as it is > 2) Add num of partitions configuration to Grouped class (as mentioned in the > KIP) > 3) Add following method overloads to KStream#through > * through() - topic will be auto-generated and num of partitions will > be taken from source topic > * through(final int numOfPartitions) - topic will be auto generated > with specified num of partitions > * through(final int numOfPartitions, final Produced<K, V> produced) - > topic will be with generated with specified num of partitions and > configuration taken from produced parameter. > > Option B: > 1) Leave Produced as it is > 2) Add num of partitions configuration to Grouped class (as mentioned in the > KIP) > 3) Add new operator KStream#repartition for creating and managing internal > repartition topics > > P.S. I’m sorry if all of this was already discussed in the mailing list, but > I kinda got with all the threads that were about this KIP :( > > Kind regards, > Levani > >> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <levani.co...@gmail.com> wrote: >> >> Hello, >> >> I would like to resurrect discussion around KIP-221. Going through the >> discussion thread, there’s seems to agreement around usefulness of this >> feature. >> Regarding the implementation, as far as I understood, the most optimal >> solution for me seems the following: >> >> 1) Add two method overloads to KStream#through method (essentially making >> topic name optional) >> 2) Enhance Produced class with numOfPartitions configuration field. >> >> Those two changes will allow DSL users to control parallelism and trigger >> re-partition without doing stateful operations. >> >> I will update KIP with interface changes around KStream#through if this >> changes sound sensible. >> >> Kind regards, >> Levani > >
signature.asc
Description: OpenPGP digital signature