Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Wed, 16 Oct 2019 11:26:23 -0700

Hello all,

Sorry for bringing this thread again, but I would like to get some attention on 
this PR: https://github.com/apache/kafka/pull/7170 
<https://github.com/apache/kafka/pull/7170> 
It's been a while now and I would love to move on to other KIPs as well. Please 
let me know if you have any concerns.


Regards,
Levani


> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <[email protected]> 
> wrote:
> 
> Hi all,
> 
> Here’s voting thread for this KIP: 
> https://www.mail-archive.com/[email protected]/msg99680.html 
> <https://www.mail-archive.com/[email protected]/msg99680.html>
> 
> Regards,
> Levani
> 
>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Matthias,
>> 
>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>> Agree that avoiding unnecessary method overloads is a good idea.
>> 
>> Updated KIP
>> 
>> Regards,
>> Levani
>> 
>> 
>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> One question:
>>> 
>>> Why do we add
>>> 
>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>> 
>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>> combination with `withName()` and `withNumberOfPartitions()` should be
>>> sufficient. Users can chain the method calls.
>>> 
>>> (I think it's valuable to keep the number of overload small if possible.)
>>> 
>>> Otherwise LGTM.
>>> 
>>> 
>>> -Matthias
>>> 
>>> 
>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>> Hello,
>>>> 
>>>> Thanks all for your feedback.
>>>> I started voting procedure for this KIP. If there’re any other concerns 
>>>> about this KIP, please let me know.
>>>> 
>>>> Regards,
>>>> Levani
>>>> 
>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hi Matthias,
>>>>> 
>>>>> Thanks for the suggestion, makes sense.
>>>>> I’ve updated KIP 
>>>>> (https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>  
>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>>>  
>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>  
>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>>).
>>>>> 
>>>>> Regards,
>>>>> Levani
>>>>> 
>>>>> 
>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <[email protected] 
>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>> <mailto:[email protected]>>> wrote:
>>>>>> 
>>>>>> Thanks for driving the KIP.
>>>>>> 
>>>>>> I agree that users need to be able to specify a partitioning strategy.
>>>>>> 
>>>>>> Sophie raises a fair point about topic configs and producer configs. My
>>>>>> take is, that consider `Repartitioned` as an "extension" to `Produced`,
>>>>>> that adds topic configuration, is a good way to think about it and helps
>>>>>> to keep the API "clean".
>>>>>> 
>>>>>> 
>>>>>> With regard to method names. I would prefer to avoid abbreviations. Can
>>>>>> we rename:
>>>>>> 
>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>> 
>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>> 
>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>> 
>>>>>> 
>>>>>> -Matthias
>>>>>> 
>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>> Totally agree. I think in KStream interface it makes sense to have some 
>>>>>>> duplicate configurations between operators in order to keep API simple 
>>>>>>> and usable.
>>>>>>> Also, as more surface API has, harder it is to have proper backward 
>>>>>>> compatibility.
>>>>>>> While initial idea of keeping topic level configs separate was 
>>>>>>> exciting, having Repartitioned class encapsulate some producer level 
>>>>>>> configs makes API more readable.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <[email protected] 
>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>> <mailto:[email protected]>>> wrote:
>>>>>>>> 
>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>> configurations and (repartition) topic level considerations separate.
>>>>>>>> Number of partitions is definitely purely a topic level configuration. 
>>>>>>>> But
>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>> configuration as a producer one. You could have two producers 
>>>>>>>> configured
>>>>>>>> with different serdes and/or partitioners, but if they are writing to 
>>>>>>>> the
>>>>>>>> same topic the result would be very difficult to part. So in a sense, 
>>>>>>>> these
>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>> 
>>>>>>>> Another way to think of it: while the Streams API is not always true to
>>>>>>>> this, ideally all the relevant configs for an operator are wrapped 
>>>>>>>> into a
>>>>>>>> single object (in this case, Repartitioned). We could instead split 
>>>>>>>> out the
>>>>>>>> fields in common with Produced into a separate parameter to keep topic 
>>>>>>>> and
>>>>>>>> producer level configurations separate, but this increases the API 
>>>>>>>> surface
>>>>>>>> area by a lot. It's much more straightforward to just say "this is
>>>>>>>> everything that this particular operator needs" without worrying about 
>>>>>>>> what
>>>>>>>> exactly you're specifying.
>>>>>>>> 
>>>>>>>> I suppose you could alternatively make Produced a field of 
>>>>>>>> Repartitioned,
>>>>>>>> but I don't think we do this kind of composition elsewhere in Streams 
>>>>>>>> at
>>>>>>>> the moment
>>>>>>>> 
>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze 
>>>>>>>> <[email protected] <mailto:[email protected]> 
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Bill,
>>>>>>>>> 
>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>> Yes, that makes sense. I’ve updated KIP with 
>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>> configuration.
>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>> configuration and keep topic level and producer level configurations 
>>>>>>>>> (such
>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>> But while looking at the semantics of KStream interface, I couldn’t 
>>>>>>>>> really
>>>>>>>>> figure out good operation name for Topic level configuration class 
>>>>>>>>> and just
>>>>>>>>> introducing `Topic` config class was kinda breaking the semantics.
>>>>>>>>> So I think having Repartitioned class which encapsulates topic and
>>>>>>>>> producer level configurations for internal topics is viable thing to 
>>>>>>>>> do.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <[email protected] 
>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>>> <mailto:[email protected]>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Lavani,
>>>>>>>>>> 
>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>> 
>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to the 
>>>>>>>>>> reason
>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>> 
>>>>>>>>>> 1. Users may want to use something other than hash-based partitioning
>>>>>>>>>> 2. Users may wish to partition on something different than the key
>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>  1. A combination of fields in the value in conjunction with the key
>>>>>>>>>>  2. Something other than the key
>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence in
>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API 
>>>>>>>>>> consistency.
>>>>>>>>>> 
>>>>>>>>>> Just my  2 cents.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Bill
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>> [email protected] <mailto:[email protected]> 
>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi John,
>>>>>>>>>>> 
>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>> If we add partitioner configuration to Repartitioned class, with the
>>>>>>>>>>> combination of specifying number of partitions for internal topics, 
>>>>>>>>>>> user
>>>>>>>>>>> will have opportunity to ensure co-partitioning before join 
>>>>>>>>>>> operation.
>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally sure 
>>>>>>>>>>>> it
>>>>>>>>>>>> makes sense, but I believe something similar is the rationale for
>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -John
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>> <[email protected] <mailto:[email protected]> 
>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>> Do I understand this correctly, in your example I would first 
>>>>>>>>>>>>> issue
>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that essentially
>>>>>>>>> would
>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>> KStream#join
>>>>>>>>>>> with DSL?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <[email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think it might be useful to have an option to specify the
>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may get
>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the 
>>>>>>>>>>>>>> right-side
>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the same
>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something important?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]> 
>>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m not sure
>>>>>>>>> about
>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would need 
>>>>>>>>>>>>>>> control
>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best how to
>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for topics
>>>>>>>>> that
>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to 
>>>>>>>>>>>>>>> specify
>>>>>>>>> the
>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess specifying 
>>>>>>>>>>>>>>> the
>>>>>>>>>>> partitioner can be added as well if there’s agreement on this.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>> [email protected] <mailto:[email protected]> 
>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned would 
>>>>>>>>>>>>>>>> be a
>>>>>>>>>>> useful
>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to Produced? I'm 
>>>>>>>>>>>>>>>> not
>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should be
>>>>>>>>> available
>>>>>>>>>>> for
>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>> [email protected] <mailto:[email protected]> 
>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over the 
>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that is 
>>>>>>>>>>>>>>>>> missing
>>>>>>>>> in
>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by 
>>>>>>>>>>>>>>>>> configuring
>>>>>>>>> num
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>> [email protected] <mailto:[email protected]> 
>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me -- for 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned, will 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case for the
>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be pre-created? 
>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method will
>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> 
>>>>>>>>>>>>>>>>> <mailto:[email protected] 
>>>>>>>>>>>>>>>>> <mailto:[email protected]>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>  
>>>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>>>>>>>  
>>>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>  
>>>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>  
>>>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns about this 
>>>>>>>>>>>>>>>>>>> KIP.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated confluence
>>>>>>>>> page
>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic` class 
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>> configuration for internal operators like `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this as well.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for 
>>>>>>>>>>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed already 
>>>>>>>>>>>>>>>>>>>>>>> (can't
>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to 
>>>>>>>>>>>>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard to the
>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and operators that 
>>>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might indicate that
>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses use-defined
>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an operator 
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is one
>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but sometimes
>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction 
>>>>>>>>>>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am 
>>>>>>>>>>>>>>>>>>>>>>> wondering
>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding it is
>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of others on
>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow to
>>>>>>>>> configure
>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we assume 
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method 
>>>>>>>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other configs, like
>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the need to
>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like them
>>>>>>>>> myself.
>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe enhancing 
>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach. Let me
>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this 
>>>>>>>>>>>>>>>>>>>>>>>> configuration,
>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is the 
>>>>>>>>>>>>>>>>>>> final
>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable assumption that 
>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that case, 
>>>>>>>>>>>>>>>>>>> having
>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API contract, 
>>>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is explicitly for
>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all are 
>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level configurations
>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like Produced
>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g topics 
>>>>>>>>>>>>>>>>>>> created
>>>>>>>>> and
>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we should 
>>>>>>>>>>>>>>>>>>> leave
>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we should
>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should create 
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that should be
>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions 
>>>>>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level 
>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in 
>>>>>>>>>>>>>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users possibility 
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :) Considering 
>>>>>>>>>>>>>>>>>>> arguments
>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two 
>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be 
>>>>>>>>>>>>>>>>>>>>>>>> auto
>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, 
>>>>>>>>>>>>>>>>>>>>>>>> V>
>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num 
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce new 
>>>>>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this in one 
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the following
>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped 
>>>>>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to KStream#through
>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be 
>>>>>>>>>>>>>>>>>>>>>>>> auto
>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, 
>>>>>>>>>>>>>>>>>>>>>>>> V>
>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num 
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped 
>>>>>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for creating 
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already discussed in 
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were about 
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around KIP-221. 
>>>>>>>>>>>>>>>>>>>>>>>>> Going
>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement around
>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I understood, 
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through method
>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>> configuration
>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to control
>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful operations.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to