Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Tue, 23 Jul 2019 14:19:07 -0700

Hello,

Thanks all for your feedback.
I started voting procedure for this KIP. If there’re any other concerns about 
this KIP, please let me know.


Regards,
Levani

> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <levani.co...@gmail.com> wrote:
> 
> Hi Matthias,
> 
> Thanks for the suggestion, makes sense.
> I’ve updated KIP 
> (https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>  
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>).
> 
> Regards,
> Levani
> 
> 
>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <matth...@confluent.io 
>> <mailto:matth...@confluent.io>> wrote:
>> 
>> Thanks for driving the KIP.
>> 
>> I agree that users need to be able to specify a partitioning strategy.
>> 
>> Sophie raises a fair point about topic configs and producer configs. My
>> take is, that consider `Repartitioned` as an "extension" to `Produced`,
>> that adds topic configuration, is a good way to think about it and helps
>> to keep the API "clean".
>> 
>> 
>> With regard to method names. I would prefer to avoid abbreviations. Can
>> we rename:
>> 
>> `withNumOfPartitions` -> `withNumberOfPartitions`
>> 
>> Furthermore, it might be good to add some more `static` methods:
>> 
>> - Repartitioned.with(Serde<K>, Serde<V>)
>> - Repartitioned.withNumberOfPartitions(int)
>> - Repartitioned.streamPartitioner(StreamPartitioner)
>> 
>> 
>> -Matthias
>> 
>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>> Totally agree. I think in KStream interface it makes sense to have some 
>>> duplicate configurations between operators in order to keep API simple and 
>>> usable.
>>> Also, as more surface API has, harder it is to have proper backward 
>>> compatibility.
>>> While initial idea of keeping topic level configs separate was exciting, 
>>> having Repartitioned class encapsulate some producer level configs makes 
>>> API more readable.
>>> 
>>> Regards,
>>> Levani
>>> 
>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <sop...@confluent.io 
>>>> <mailto:sop...@confluent.io>> wrote:
>>>> 
>>>> I think that is a good point about trying to keep producer level
>>>> configurations and (repartition) topic level considerations separate.
>>>> Number of partitions is definitely purely a topic level configuration. But
>>>> on some level, serdes and partitioners are just as much a topic
>>>> configuration as a producer one. You could have two producers configured
>>>> with different serdes and/or partitioners, but if they are writing to the
>>>> same topic the result would be very difficult to part. So in a sense, these
>>>> are configurations of topics in Streams, not just producers.
>>>> 
>>>> Another way to think of it: while the Streams API is not always true to
>>>> this, ideally all the relevant configs for an operator are wrapped into a
>>>> single object (in this case, Repartitioned). We could instead split out the
>>>> fields in common with Produced into a separate parameter to keep topic and
>>>> producer level configurations separate, but this increases the API surface
>>>> area by a lot. It's much more straightforward to just say "this is
>>>> everything that this particular operator needs" without worrying about what
>>>> exactly you're specifying.
>>>> 
>>>> I suppose you could alternatively make Produced a field of Repartitioned,
>>>> but I don't think we do this kind of composition elsewhere in Streams at
>>>> the moment
>>>> 
>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <levani.co...@gmail.com 
>>>> <mailto:levani.co...@gmail.com>>
>>>> wrote:
>>>> 
>>>>> Hi Bill,
>>>>> 
>>>>> Thanks a lot for the feedback.
>>>>> Yes, that makes sense. I’ve updated KIP with `Repartitioned#partitioner`
>>>>> configuration.
>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>> configuration and keep topic level and producer level configurations (such
>>>>> as Produced) separately (see my second email in this thread).
>>>>> But while looking at the semantics of KStream interface, I couldn’t really
>>>>> figure out good operation name for Topic level configuration class and 
>>>>> just
>>>>> introducing `Topic` config class was kinda breaking the semantics.
>>>>> So I think having Repartitioned class which encapsulates topic and
>>>>> producer level configurations for internal topics is viable thing to do.
>>>>> 
>>>>> Regards,
>>>>> Levani
>>>>> 
>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbej...@gmail.com 
>>>>>> <mailto:bbej...@gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Lavani,
>>>>>> 
>>>>>> Thanks for resurrecting this KIP.
>>>>>> 
>>>>>> I'm also a +1 for adding a partition option.  In addition to the reason
>>>>>> provided by John, my reasoning is:
>>>>>> 
>>>>>> 1. Users may want to use something other than hash-based partitioning
>>>>>> 2. Users may wish to partition on something different than the key
>>>>>> without having to change the key.  For example:
>>>>>>    1. A combination of fields in the value in conjunction with the key
>>>>>>    2. Something other than the key
>>>>>> 3. We allow users to specify a partitioner on Produced hence in
>>>>>> KStream.to and KStream.through, so it makes sense for API consistency.
>>>>>> 
>>>>>> Just my  2 cents.
>>>>>> 
>>>>>> Thanks,
>>>>>> Bill
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi John,
>>>>>>> 
>>>>>>> In my mind it makes sense.
>>>>>>> If we add partitioner configuration to Repartitioned class, with the
>>>>>>> combination of specifying number of partitions for internal topics, user
>>>>>>> will have opportunity to ensure co-partitioning before join operation.
>>>>>>> I think this can be quite powerful feature.
>>>>>>> Wondering what others think about this?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <j...@confluent.io 
>>>>>>>> <mailto:j...@confluent.io>> wrote:
>>>>>>>> 
>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally sure it
>>>>>>>> makes sense, but I believe something similar is the rationale for
>>>>>>>> having the partitioner option in Produced.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> -John
>>>>>>>> 
>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>> <levani.co...@gmail.com <mailto:levani.co...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hey John,
>>>>>>>>> 
>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>> Do I understand this correctly, in your example I would first issue
>>>>>>> repartition(Repartitioned) with proper partitioner that essentially
>>>>> would
>>>>>>> be the same as the topic I want to join with and then do the
>>>>> KStream#join
>>>>>>> with DSL?
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <j...@confluent.io 
>>>>>>>>>> <mailto:j...@confluent.io>>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>> 
>>>>>>>>>> I think it might be useful to have an option to specify the
>>>>>>>>>> partitioner. The case I have in mind is that some data may get
>>>>>>>>>> repartitioned and then joined with an input topic. If the right-side
>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>> repartitioned stream also needs to be partitioned with the same
>>>>>>>>>> strategy.
>>>>>>>>>> 
>>>>>>>>>> Does that make sense, or did I maybe miss something important?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> -John
>>>>>>>>>> 
>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>> <levani.co...@gmail.com <mailto:levani.co...@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m not sure
>>>>> about
>>>>>>> it yet.
>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would need control
>>>>>>> over partitioner for internal topics.
>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best how to
>>>>>>> partition data for internal topics.
>>>>>>>>>>> In this KIP I wrote that Produced should be used only for topics
>>>>> that
>>>>>>> are created by user In advance.
>>>>>>>>>>> In those cases maybe it make sense to have possibility to specify
>>>>> the
>>>>>>> partitioner.
>>>>>>>>>>> I don’t have clear answer on that yet, but I guess specifying the
>>>>>>> partitioner can be added as well if there’s agreement on this.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>> sop...@confluent.io <mailto:sop...@confluent.io>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned would be a
>>>>>>> useful
>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>> a withStreamPartitioner method/field, similar to Produced? I'm not
>>>>>>> sure how
>>>>>>>>>>>> widely this feature is really used, but seems it should be
>>>>> available
>>>>>>> for
>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over the topic
>>>>>>> such as
>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>> I feel like Repartitioned parameter is something that is missing
>>>>> in
>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>> Essentially giving user control over parallelism by configuring
>>>>> num
>>>>>>> of
>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>> sop...@confluent.io <mailto:sop...@confluent.io>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me -- for the
>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned, will the
>>>>>>> topic be
>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case for the
>>>>>>> signature
>>>>>>>>>>>>>> without a Repartitioned) or does it have to be pre-created? The
>>>>>>> wording
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the KIP makes it seem like one version of the method will
>>>>>>> auto-create
>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>  
>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns about this KIP.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>> levani.co...@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>> levani.co...@gmail.com
>>>>>>>>>>>>>>> <mailto:levani.co...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated confluence
>>>>> page
>>>>>>> with
>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>> levani.co...@gmail.com
>>>>>>>>>>>>>>> <mailto:levani.co...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic` class for
>>>>>>> topic
>>>>>>>>>>>>>>> configuration for internal operators like `groupedBy`.
>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this as well.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>> matth...@confluent.io
>>>>>>>>>>>>>>> <mailto:matth...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for summarizing
>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed already (can't
>>>>>>> really
>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to refresh the
>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard to the
>>>>>>> distinction
>>>>>>>>>>>>>>>>>>> between operators that manage topics and operators that use
>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might indicate that
>>>>>>> leaving
>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses use-defined
>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an operator that
>>>>>>> manages
>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is one
>>>>> operator
>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but sometimes
>>>>> not; a
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction clearer.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am wondering
>>>>>>> if the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding it is
>>>>>>> semantically
>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of others on
>>>>>>> this, too.
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow to
>>>>> configure
>>>>>>> a
>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we assume that
>>>>>>> a topic
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method instead of
>>>>>>> a plain
>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other configs, like
>>>>>>> replication
>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the need to
>>>>>>> change the
>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like them
>>>>> myself.
>>>>>>> :)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe enhancing Produced
>>>>>>> with num
>>>>>>>>>>>>>>> of partitions configuration is not the best approach. Let me
>>>>>>> explain
>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this configuration,
>>>>>>> this will
>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is the final
>>>>>>> sink of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable assumption that user
>>>>>>> needs
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> manually create sink topic in advance. And in that case, having
>>>>>>> num of
>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API contract, seems
>>>>>>> like
>>>>>>>>>>>>>>> Produced is designed to be something that is explicitly for
>>>>>>> producer
>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all are producer
>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>> configuration. And
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> don’t think mixing topic and producer level configurations
>>>>>>> together in
>>>>>>>>>>>>> one
>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like Produced
>>>>>>> parameter is
>>>>>>>>>>>>>>> for operations that work with non-internal (e.g topics created
>>>>> and
>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we should leave
>>>>>>> it as
>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we should
>>>>>>> distinguish
>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should create and
>>>>>>> manage
>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that should be
>>>>>>> created in
>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions configuration in
>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>> will result in mixing topic and producer level configuration in
>>>>>>> one
>>>>>>>>>>>>> class
>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in KStream#through - I
>>>>>>> think
>>>>>>>>>>>>>>> underline idea is very useful and giving users possibility to
>>>>>>> specify
>>>>>>>>>>>>> num
>>>>>>>>>>>>>>> of partitions there is even more useful :) Considering arguments
>>>>>>> against
>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two options
>>>>>>> here:
>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>> partitions
>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num of
>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce new method
>>>>> -
>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this in one of
>>>>> the
>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the following
>>>>> plan:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to KStream#through
>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>> partitions
>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num of
>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for creating and
>>>>>>> managing
>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already discussed in the
>>>>>>> mailing
>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were about this
>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around KIP-221. Going
>>>>>>> through
>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement around
>>>>>>> usefulness of
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I understood, the
>>>>>>> most
>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through method
>>>>>>> (essentially
>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>> configuration
>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to control
>>>>>>> parallelism and
>>>>>>>>>>>>>>> trigger re-partition without doing stateful operations.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>> KStream#through if
>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to