Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Fri, 26 Jul 2019 01:26:31 -0700

Hi all,

Here’s voting thread for this KIP: 
https://www.mail-archive.com/[email protected]/msg99680.html 
<https://www.mail-archive.com/[email protected]/msg99680.html>


Regards,
Levani

> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <[email protected]> 
> wrote:
> 
> Hi Matthias,
> 
> Thanks for the suggestion. I Don’t have strong opinion on that one.
> Agree that avoiding unnecessary method overloads is a good idea.
> 
> Updated KIP
> 
> Regards,
> Levani
> 
> 
>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <[email protected]> wrote:
>> 
>> One question:
>> 
>> Why do we add
>> 
>>> Repartitioned#with(final String name, final int numberOfPartitions)
>> 
>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>> combination with `withName()` and `withNumberOfPartitions()` should be
>> sufficient. Users can chain the method calls.
>> 
>> (I think it's valuable to keep the number of overload small if possible.)
>> 
>> Otherwise LGTM.
>> 
>> 
>> -Matthias
>> 
>> 
>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>> Hello,
>>> 
>>> Thanks all for your feedback.
>>> I started voting procedure for this KIP. If there’re any other concerns 
>>> about this KIP, please let me know.
>>> 
>>> Regards,
>>> Levani
>>> 
>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <[email protected]> 
>>>> wrote:
>>>> 
>>>> Hi Matthias,
>>>> 
>>>> Thanks for the suggestion, makes sense.
>>>> I’ve updated KIP 
>>>> (https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>  
>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>).
>>>> 
>>>> Regards,
>>>> Levani
>>>> 
>>>> 
>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Thanks for driving the KIP.
>>>>> 
>>>>> I agree that users need to be able to specify a partitioning strategy.
>>>>> 
>>>>> Sophie raises a fair point about topic configs and producer configs. My
>>>>> take is, that consider `Repartitioned` as an "extension" to `Produced`,
>>>>> that adds topic configuration, is a good way to think about it and helps
>>>>> to keep the API "clean".
>>>>> 
>>>>> 
>>>>> With regard to method names. I would prefer to avoid abbreviations. Can
>>>>> we rename:
>>>>> 
>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>> 
>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>> 
>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>> 
>>>>> 
>>>>> -Matthias
>>>>> 
>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>> Totally agree. I think in KStream interface it makes sense to have some 
>>>>>> duplicate configurations between operators in order to keep API simple 
>>>>>> and usable.
>>>>>> Also, as more surface API has, harder it is to have proper backward 
>>>>>> compatibility.
>>>>>> While initial idea of keeping topic level configs separate was exciting, 
>>>>>> having Repartitioned class encapsulate some producer level configs makes 
>>>>>> API more readable.
>>>>>> 
>>>>>> Regards,
>>>>>> Levani
>>>>>> 
>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>> configurations and (repartition) topic level considerations separate.
>>>>>>> Number of partitions is definitely purely a topic level configuration. 
>>>>>>> But
>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>> configuration as a producer one. You could have two producers configured
>>>>>>> with different serdes and/or partitioners, but if they are writing to 
>>>>>>> the
>>>>>>> same topic the result would be very difficult to part. So in a sense, 
>>>>>>> these
>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>> 
>>>>>>> Another way to think of it: while the Streams API is not always true to
>>>>>>> this, ideally all the relevant configs for an operator are wrapped into 
>>>>>>> a
>>>>>>> single object (in this case, Repartitioned). We could instead split out 
>>>>>>> the
>>>>>>> fields in common with Produced into a separate parameter to keep topic 
>>>>>>> and
>>>>>>> producer level configurations separate, but this increases the API 
>>>>>>> surface
>>>>>>> area by a lot. It's much more straightforward to just say "this is
>>>>>>> everything that this particular operator needs" without worrying about 
>>>>>>> what
>>>>>>> exactly you're specifying.
>>>>>>> 
>>>>>>> I suppose you could alternatively make Produced a field of 
>>>>>>> Repartitioned,
>>>>>>> but I don't think we do this kind of composition elsewhere in Streams at
>>>>>>> the moment
>>>>>>> 
>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze 
>>>>>>> <[email protected] <mailto:[email protected]>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Bill,
>>>>>>>> 
>>>>>>>> Thanks a lot for the feedback.
>>>>>>>> Yes, that makes sense. I’ve updated KIP with 
>>>>>>>> `Repartitioned#partitioner`
>>>>>>>> configuration.
>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>> configuration and keep topic level and producer level configurations 
>>>>>>>> (such
>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>> But while looking at the semantics of KStream interface, I couldn’t 
>>>>>>>> really
>>>>>>>> figure out good operation name for Topic level configuration class and 
>>>>>>>> just
>>>>>>>> introducing `Topic` config class was kinda breaking the semantics.
>>>>>>>> So I think having Repartitioned class which encapsulates topic and
>>>>>>>> producer level configurations for internal topics is viable thing to 
>>>>>>>> do.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Levani
>>>>>>>> 
>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Lavani,
>>>>>>>>> 
>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>> 
>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to the 
>>>>>>>>> reason
>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>> 
>>>>>>>>> 1. Users may want to use something other than hash-based partitioning
>>>>>>>>> 2. Users may wish to partition on something different than the key
>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>  1. A combination of fields in the value in conjunction with the key
>>>>>>>>>  2. Something other than the key
>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence in
>>>>>>>>> KStream.to and KStream.through, so it makes sense for API consistency.
>>>>>>>>> 
>>>>>>>>> Just my  2 cents.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Bill
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi John,
>>>>>>>>>> 
>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>> If we add partitioner configuration to Repartitioned class, with the
>>>>>>>>>> combination of specifying number of partitions for internal topics, 
>>>>>>>>>> user
>>>>>>>>>> will have opportunity to ensure co-partitioning before join 
>>>>>>>>>> operation.
>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Levani
>>>>>>>>>> 
>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally sure it
>>>>>>>>>>> makes sense, but I believe something similar is the rationale for
>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -John
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hey John,
>>>>>>>>>>>> 
>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>> Do I understand this correctly, in your example I would first issue
>>>>>>>>>> repartition(Repartitioned) with proper partitioner that essentially
>>>>>>>> would
>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>> KStream#join
>>>>>>>>>> with DSL?
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think it might be useful to have an option to specify the
>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may get
>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the 
>>>>>>>>>>>>> right-side
>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the same
>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Does that make sense, or did I maybe miss something important?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m not sure
>>>>>>>> about
>>>>>>>>>> it yet.
>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would need 
>>>>>>>>>>>>>> control
>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best how to
>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for topics
>>>>>>>> that
>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to specify
>>>>>>>> the
>>>>>>>>>> partitioner.
>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess specifying the
>>>>>>>>>> partitioner can be added as well if there’s agreement on this.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>> [email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned would 
>>>>>>>>>>>>>>> be a
>>>>>>>>>> useful
>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to Produced? I'm 
>>>>>>>>>>>>>>> not
>>>>>>>>>> sure how
>>>>>>>>>>>>>>> widely this feature is really used, but seems it should be
>>>>>>>> available
>>>>>>>>>> for
>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over the 
>>>>>>>>>>>>>>>> topic
>>>>>>>>>> such as
>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that is 
>>>>>>>>>>>>>>>> missing
>>>>>>>> in
>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by configuring
>>>>>>>> num
>>>>>>>>>> of
>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me -- for 
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned, will the
>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case for the
>>>>>>>>>> signature
>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be pre-created? 
>>>>>>>>>>>>>>>>> The
>>>>>>>>>> wording
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method will
>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>  
>>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns about this 
>>>>>>>>>>>>>>>>>> KIP.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated confluence
>>>>>>>> page
>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic` class for
>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>> configuration for internal operators like `groupedBy`.
>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this as well.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for 
>>>>>>>>>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed already 
>>>>>>>>>>>>>>>>>>>>>> (can't
>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to refresh 
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard to the
>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and operators that 
>>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might indicate that
>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses use-defined
>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an operator 
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is one
>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but sometimes
>>>>>>>> not; a
>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction 
>>>>>>>>>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am 
>>>>>>>>>>>>>>>>>>>>>> wondering
>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding it is
>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of others on
>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow to
>>>>>>>> configure
>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we assume 
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method 
>>>>>>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other configs, like
>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the need to
>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like them
>>>>>>>> myself.
>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe enhancing 
>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach. Let me
>>>>>>>>>> explain
>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this configuration,
>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is the 
>>>>>>>>>>>>>>>>>> final
>>>>>>>>>> sink of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable assumption that 
>>>>>>>>>>>>>>>>>> user
>>>>>>>>>> needs
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that case, 
>>>>>>>>>>>>>>>>>> having
>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API contract, 
>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>> like
>>>>>>>>>>>>>>>>>> Produced is designed to be something that is explicitly for
>>>>>>>>>> producer
>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all are 
>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level configurations
>>>>>>>>>> together in
>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like Produced
>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g topics 
>>>>>>>>>>>>>>>>>> created
>>>>>>>> and
>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we should 
>>>>>>>>>>>>>>>>>> leave
>>>>>>>>>> it as
>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we should
>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should create and
>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that should be
>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions 
>>>>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level configuration 
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>> one
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in KStream#through 
>>>>>>>>>>>>>>>>>>>>>>> - I
>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users possibility to
>>>>>>>>>> specify
>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>> of partitions there is even more useful :) Considering 
>>>>>>>>>>>>>>>>>> arguments
>>>>>>>>>> against
>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two options
>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be 
>>>>>>>>>>>>>>>>>>>>>>> auto
>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, 
>>>>>>>>>>>>>>>>>>>>>>> V>
>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num 
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce new 
>>>>>>>>>>>>>>>>>>>>>>> method
>>>>>>>> -
>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this in one 
>>>>>>>>>>>>>>>>>> of
>>>>>>>> the
>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the following
>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class 
>>>>>>>>>>>>>>>>>>>>>>> (as
>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to KStream#through
>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be 
>>>>>>>>>>>>>>>>>>>>>>> auto
>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, 
>>>>>>>>>>>>>>>>>>>>>>> V>
>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num 
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class 
>>>>>>>>>>>>>>>>>>>>>>> (as
>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for creating and
>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already discussed in 
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were about 
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around KIP-221. 
>>>>>>>>>>>>>>>>>>>>>>>> Going
>>>>>>>>>> through
>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement around
>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I understood, 
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>> most
>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through method
>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>> configuration
>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to control
>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful operations.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to