Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Matthias J. Sax Sat, 16 Nov 2019 13:04:13 -0800
Thanks a lot Levani!

On 11/16/19 4:00 AM, Levani Kokhreidze wrote:
> Matthias,
> 
> Yes, I agree. KIP is updated: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint
>  
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint>
>  and follow-up JIRA ticket is linked in “Rejected Alternatives” section.
> 
> Thank you all for an interesting discussion.
> 
> Kind Regards,
> Levani
> 
>> On Nov 16, 2019, at 10:11 AM, Matthias J. Sax <[email protected]> wrote:
>>
>> Levani,
>>
>> do you agree to the current proposal? It's basically a de-scoping of the
>> already voted KIP. If you agree, could you update the KIP wiki page
>> accordingly, including the "Rejected Alternative" section (and mabye a
>> link to a follow up Jira ticket).
>>
>> Because it's a descope, and John and myself support it, there seems to
>> be no need to re-vote.
>>
>> @Sophie,John: thanks a lot for your thoughtful input!
>>
>>
>> -Matthias
>>
>> On 11/15/19 12:47 PM, John Roesler wrote:
>>> Thanks Sophie,
>>>
>>> I think your concern is valid, and also that your idea to make a
>>> ticket is a good idea.
>>>
>>> Creating a ticket has some very positive effects:
>>> * It allows us to record the thinking at this point in time so we
>>> don't have to dig through the mail archives later
>>> * It demonstrates that we did consider the use case, and do want to
>>> address it, but just don't feel confident to implement it right now.
>>> Then, if/when people do have a problem with the gap, the ticket it
>>> already there for them to consider, request, or even pick up.
>>>
>>> Since one aspect of the deferral is a desire to wait for real use
>>> experience, we should explicitly mention that in the ticket. This is
>>> just good information for people browsing the Jira looking for
>>> interesting tickets to pick up. They could still pick it up, but they
>>> can ask themselves if they really understand the real-world use cases
>>> any better than we do right now.
>>>
>>> Thanks, likewise, to you for the good discussion!
>>> -John
>>>
>>> On Fri, Nov 15, 2019 at 2:37 PM Sophie Blee-Goldman <[email protected]> 
>>> wrote:
>>>>
>>>> While I'm concerned that "not augmenting groupBy as part of this KIP"
>>>> really translates to "will not get around to augmenting groupBy for a long
>>>> time if not as part of this KIP", like I said I don't want to hold up the
>>>> new
>>>> .repartition operator that it seems we do, at least, all agree on. It's a
>>>> fair
>>>> point that we can always add this in later, but undoing it is far more
>>>> problematic.
>>>>
>>>> Anyways, I would be happy if we at least make a ticket to consider adding a
>>>> "number of partitions" option/suggestion to groupBy, so that we don't lose
>>>> all the thought put in to this decision so far and can avoid rehashing the
>>>> same
>>>> argument word for word and have something to point to when someone
>>>> asks "why didn't we add this numPartitions option to groupBy".
>>>>
>>>> Beyond that, if the community isn't pushing for it at this moment then it
>>>> seems very
>>>> reasonable to shelve the idea for now so that the rest of this KIP can
>>>> proceed.
>>>> Without input one way or another it's hard to say what the right thing to
>>>> do is,
>>>> which makes the right thing to do "wait to add this feature"
>>>>
>>>> Thanks for the good discussion everyone,
>>>>
>>>> Sophie
>>>>
>>>> On Fri, Nov 15, 2019 at 12:41 PM John Roesler <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I think that Sophie is asking a good question, and I do think that
>>>>> such "blanket configurations" are plausible. For example, we currently
>>>>> support (and I would encourage) "I don't know if this is going to
>>>>> create a repartition topic, but if it does, then use this name instead
>>>>> of generating one".
>>>>>
>>>>> I'm not sure I'm convinced that specifying max parallelism falls into
>>>>> this category. After all, the groupByKey+aggregate will be executed
>>>>> with _some_ max parallelism. It's either the same as the inputs'
>>>>> partition count or overridden with the proposed config. It seems
>>>>> counterintuitive to override the specified option with the default
>>>>> value.
>>>>>
>>>>> I'm not sure if I can put my finger on it, but "maybe use this name"
>>>>> seems way more reasonable to me than "maybe execute with this degree
>>>>> of parallelism".
>>>>>
>>>>> I do think (and I appreciate that this is where Sophie's example is
>>>>> coming from) that Streams should strive to be absolutely as simple and
>>>>> intuitive as possible (while still maintaining correctness). Optimal
>>>>> performance can be at odds with API simplicity. For example, the
>>>>> simplest behavior is, if you ask for 5 partitions, you get 5
>>>>> partitions. Maybe a repartition is technically not necessary (if you
>>>>> didn't change the key), but at least there's no mystery to this
>>>>> behavior.
>>>>>
>>>>> Clearly, an (opposing) tenent of simplicity is trying to prevent
>>>>> people from making mistakes, which I think is what the example boils
>>>>> down to. Sometimes, we can prevent clear mistakes, like equi-joining
>>>>> two topics with different partition counts. But for this case, it
>>>>> doesn't seem as clear-cut to be able to assume that they _said_ 5
>>>>> partitions, but they didn't really _want_ 5 partitions. Maybe we can
>>>>> just try to be clear in the documentation, and also even log a warning
>>>>> when we parse the topology, "hey, I've been asked to repartition this
>>>>> stream, but it's not necessary".
>>>>>
>>>>> If anything, this discussion really supports to me the value in just
>>>>> sticking with `repartition()` for now, and deferring
>>>>> `groupBy[Key](partitions)` to the future.
>>>>>
>>>>>> Users should not have to choose between allowing Streams to optimize the
>>>>> repartition placement, and allowing to specify a number of partitions.
>>>>>
>>>>> This is a very fair point, and it may be something that we rapidly
>>>>> return to, but it seems safe for now to introduce the non-optimizable
>>>>> `reparition()` only, and then consider optimization options later.
>>>>> Skipping available optimizations will never break correctness, but
>>>>> adding optimizations can, so it makes sense to treat them with
>>>>> caution.
>>>>>
>>>>> In conclusion, I do think that a use _could_ want to "maybe specify"
>>>>> the partition count, but I also think we can afford to pass on
>>>>> supporting this right now.
>>>>>
>>>>> I'm open to continuing the discussion, but just to avoid ambiguity, I
>>>>> still feel we should _not_ change the groupBy[Key] operation at all,
>>>>> and we should only add `repartition()` as a non-optimizable operation.
>>>>>
>>>>> Thanks all,
>>>>> -John
>>>>>
>>>>> On Fri, Nov 15, 2019 at 11:26 AM Levani Kokhreidze
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Just fyi, PR was updated and now it incorporates the latest suggestions
>>>>> about joins.
>>>>>> `CopartitionedTopicsEnforcer` will throw an exception if number of
>>>>> partitions aren’t the same when using `repartition` operation along with
>>>>> `join`.
>>>>>>
>>>>>> For more details please take a look at the PR:
>>>>> https://github.com/apache/kafka/pull/7170/files <
>>>>> https://github.com/apache/kafka/pull/7170/files>
>>>>>>
>>>>>> Regards,
>>>>>> Levani
>>>>>>
>>>>>>
>>>>>>> On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <[email protected]>
>>>>> wrote:
>>>>>>>
>>>>>>> Thanks a lot for the input Sophie.
>>>>>>>
>>>>>>> Your example is quite useful, and I would use it to support my claim
>>>>>>> that a "partition hint" for `Grouped` seems "useless" and does not
>>>>>>> improve the user experience.
>>>>>>>
>>>>>>> 1) You argue that a new user would be worries about repartitions topics
>>>>>>> with too many paritions. This would imply that a user is already
>>>>>>> advanced enough to understand the implication of repartitioning -- for
>>>>>>> this case, I would argue that a user also understand _when_ a
>>>>>>> auto-repartitioning would happen and thus the users understands where
>>>>> to
>>>>>>> insert a `repartition()` operation.
>>>>>>>
>>>>>>> 2) For specifying Serdes: if a `groupByKey()` does not trigger
>>>>>>> auto-repartitioning it's not required to specify the serdes and if they
>>>>>>> are specified they would be ignored/unused (note, that `groupBy()`
>>>>> would
>>>>>>> always trigger a repartitioning). Of course, if the default Serdes from
>>>>>>> the config match (eg, all data types are Json anyway), a user does not
>>>>>>> need to worry about specifying serdes. -- For new user that play
>>>>> around,
>>>>>>> I would assume that they work a lot with primitive types and thus would
>>>>>>> need to specify the serdes -- hence, they would learn about
>>>>>>> auto-repartitioning the hard way anyhow, because each time a
>>>>>>> `groupByKey()` does trigger auto-repartioning, they would need to pass
>>>>>>> in the correct Serdes -- this way, they would also be educated where to
>>>>>>> insert a `repartition()` operator if needed.
>>>>>>>
>>>>>>> 3) If a new user really just "plays around", I don't think they use an
>>>>>>> input topic with 100 partitions but most likely have a local single
>>>>> node
>>>>>>> broker with most likely single partitions topics.
>>>>>>>
>>>>>>>
>>>>>>> My main argument for my current proposal is however, that---based on
>>>>>>> past experience---it's better to roll out a new feature more carefully
>>>>>>> and see how it goes. Last, as John pointed out, we can still extend the
>>>>>>> feature in the future. Instead of making a judgment call up-front,
>>>>> being
>>>>>>> more conservative and less fancy, and revisit the design based on
>>>>>>> actuall user feedback after the first version is rolled out, seems to
>>>>> be
>>>>>>> the better option. Undoing a feature is must harder than extending it.
>>>>>>>
>>>>>>>
>>>>>>> While I advocate strong for a simple first version of this feature,
>>>>> it's
>>>>>>> a community decission in the end, and I would not block this KIP if
>>>>>>> there is a broad preference to add `Grouped#withNumberOfPartitions()`
>>>>>>> either.
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>> On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
>>>>>>>> It seems like we all agree at this point (please correct me if
>>>>> wrong!) that
>>>>>>>> we should NOT change
>>>>>>>> the existing repartitioning behavior, ie we should allow Streams to
>>>>>>>> continue to determine when and
>>>>>>>> where to repartition -- *unless* explicitly informed to by the use of
>>>>> a
>>>>>>>> .through or the new .repartition operator.
>>>>>>>>
>>>>>>>> Regarding groupBy, the existing behavior we should not disrupt is
>>>>>>>> a) repartition *only* when required due to upstream key-changing
>>>>> operation
>>>>>>>> (ie don't force repartitioning
>>>>>>>> based on the presence of an optional config parameter), and
>>>>>>>> b) allow optimization of required repartitions, if any
>>>>>>>>
>>>>>>>> Within the constraint of not breaking the existing behavior, this
>>>>> still
>>>>>>>> leaves open the question of whether we
>>>>>>>> want to improve the user experience by allowing to provide groupBy
>>>>> with a
>>>>>>>> *suggestion* for numPartitions (or to
>>>>>>>> put it more fairly, whether that *will* improve the experience). I
>>>>> agree
>>>>>>>> with many of the arguments outlined above but
>>>>>>>> let me just push back on this one issue one final time, and if we
>>>>> can't
>>>>>>>> come to a consensus then I am happy to drop
>>>>>>>> it for now so that the KIP can proceed.
>>>>>>>>
>>>>>>>> Specifically, my proposal would be to simply augment Grouped with an
>>>>>>>> optional numPartitions, understood to
>>>>>>>> indicate the user's desired number of partitions *if Streams decides
>>>>> to
>>>>>>>> repartition due to that groupBy*
>>>>>>>>
>>>>>>>>> if a user cares about the number of partition, the user wants to
>>>>> enforce
>>>>>>>> a repartitioning
>>>>>>>> First, I think we should take a step back and examine this claim. I
>>>>> agree
>>>>>>>> 100% that *if this is true,*
>>>>>>>> *then we should not give groupBy an optional numPartitions.* As far
>>>>> as I
>>>>>>>> see it, there's no argument
>>>>>>>> to be had there if we *presuppose that claim.* But I'm not convinced
>>>>> in
>>>>>>>> that as an axiom of the user
>>>>>>>> experience and think we should be examining that claim itself, not the
>>>>>>>> consequences of it.
>>>>>>>>
>>>>>>>> To give a simple example, let's say some new user is trying out
>>>>> Streams and
>>>>>>>> wants to just play around
>>>>>>>> with it to see if it might be worth looking into. They want to just
>>>>> write
>>>>>>>> up a simple app and test it out on the
>>>>>>>> data in some existing topics they have with a large number of
>>>>> partitions,
>>>>>>>> and a lot of data. They're just messing
>>>>>>>> around, trying new topologies and don't want to go through each new
>>>>> one
>>>>>>>> step by step to determine if (or where)
>>>>>>>> a repartition might be required. They also don't want to force a
>>>>>>>> repartition if it turns out to not be required, so they'd
>>>>>>>> like to avoid the nice new .repartition operator they saw. But given
>>>>> the
>>>>>>>> huge number of input partitions, they'd like
>>>>>>>> to rest assured that if a repartition does end up being required
>>>>> somewhere
>>>>>>>> during dev, it will not be created with
>>>>>>>> the same huge number of partitions that their input topic has -- so
>>>>> they
>>>>>>>> just pass groupBy a small numPartitions
>>>>>>>> suggestion.
>>>>>>>>
>>>>>>>> I know that's a bit of a contrived example but I think it does
>>>>> highlight
>>>>>>>> how and when this might be a considerable
>>>>>>>> quality of life improvement, in particular for new users to Streams
>>>>> and/or
>>>>>>>> during the dev cycle. *You don't want to*
>>>>>>>> *force a repartition if it wasn't necessary, but you don't want to
>>>>> create a
>>>>>>>> topic with a huge partition count either.*
>>>>>>>>
>>>>>>>> Also, while the optimization discussion took us down an interesting
>>>>> but
>>>>>>>> ultimately more distracting road, it's worth
>>>>>>>> pointing out that it is clearly a major win to have as few
>>>>>>>> repartition topics/steps as possible. Given that we
>>>>>>>> don't want to change existing behavior, the optimization framework
>>>>> can only
>>>>>>>> help out when the placement of
>>>>>>>> repartition steps is flexible, which means only those from .groupBy
>>>>> (and
>>>>>>>> not .repartition). *Users should not*
>>>>>>>> *have to choose between allowing Streams to optimize the repartition
>>>>>>>> placement, and allowing to specify a *
>>>>>>>> *number of partitions.*
>>>>>>>>
>>>>>>>> Lastly, I have what may be a stupid question but for my own
>>>>> edification of
>>>>>>>> how groupBy works:
>>>>>>>> if you do a .groupBy and a repartition is NOT required, does it ever
>>>>> need
>>>>>>>> to serialize/deserialize
>>>>>>>> any of the data? In other words, if you pass a key/value serde to
>>>>> groupBy
>>>>>>>> and it doesn't trigger
>>>>>>>> a repartition, is the serde(s) just ignored and thus more like a
>>>>> suggestion
>>>>>>>> than a requirement?
>>>>>>>>
>>>>>>>> So again, I don't want to hold up this KIP forever but I feel we've
>>>>> spent
>>>>>>>> some time getting slightly
>>>>>>>> off track (although certainly into very interesting discussions) yet
>>>>> never
>>>>>>>> really addressed or questioned
>>>>>>>> the basic premise: *could a user want to specify a number of
>>>>> partitions but
>>>>>>>> not enforce a repartition (at that*
>>>>>>>> *specific point in the topology)?*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <
>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Side remark:
>>>>>>>>>
>>>>>>>>> If the user specifies `repartition()` on both side of the join, we
>>>>> can
>>>>>>>>> actually throw the execption earlier, ie, when we build the topology.
>>>>>>>>>
>>>>>>>>> Current, we can do this check only after Kafka Streams was started,
>>>>>>>>> within `StreamPartitionAssignor#assign()` -- we still need to keep
>>>>> this
>>>>>>>>> check for the case that none or only one side has a user specified
>>>>>>>>> number of partitions though.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>> On 11/14/19 8:15 AM, John Roesler wrote:
>>>>>>>>>> Thanks, all,
>>>>>>>>>>
>>>>>>>>>> I can get behind just totally leaving out reparation-via-groupBy. If
>>>>>>>>>> we only introduce `repartition()` for now, we're making the minimal
>>>>>>>>>> change to gain the desired capability.
>>>>>>>>>>
>>>>>>>>>> Plus, since we agree that `repartition()` should never be
>>>>> optimizable,
>>>>>>>>>> it's a future-compatible proposal. I.e., if we were to add a
>>>>>>>>>> non-optimizable groupBy(partitions) operation now, and want to make
>>>>> it
>>>>>>>>>> optimizable in the future, we have to worry about topology
>>>>>>>>>> compatibility. Better to just do non-optimizable `repartition()`
>>>>> now,
>>>>>>>>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
>>>>>>>>>>
>>>>>>>>>> About joins, yes, it's a concern, and IMO we should just do the same
>>>>>>>>>> thing we do now... check at runtime that the partition counts on
>>>>> both
>>>>>>>>>> sides match and throw an exception otherwise. What this means as a
>>>>>>>>>> user is that if you explicitly repartition the left side to 100
>>>>>>>>>> partitions, and then join with the right side at 10 partitions, you
>>>>>>>>>> get an exception, since this operation is not possible. You'd either
>>>>>>>>>> have to "step down" the left side again, back to 10 partitions, or
>>>>> you
>>>>>>>>>> could repartition the right side to 100 partitions before the join.
>>>>>>>>>> The choice has to be the user's, since it depends on their desired
>>>>>>>>>> execution parallelism.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -John
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <
>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot John. I think the way you decompose the operators is
>>>>> super
>>>>>>>>>>> helpful for this discussion.
>>>>>>>>>>>
>>>>>>>>>>> What you suggest with regard to using `Grouped` and enforcing
>>>>>>>>>>> repartitioning if the number of partitions is specified is
>>>>> certainly
>>>>>>>>>>> possible. However, I am not sure if we _should_ do this. My
>>>>> reasoning is
>>>>>>>>>>> that an enforce repartitioning as introduced via `repartition()`
>>>>> is an
>>>>>>>>>>> expensive operations, and it seems better to demand an more
>>>>> explicit
>>>>>>>>>>> user opt-in to trigger it. Just setting an optional parameter
>>>>> might be
>>>>>>>>>>> too subtle to trigger such a heavy "side effect".
>>>>>>>>>>>
>>>>>>>>>>> While I agree about "usability" in general, I would prefer a more
>>>>>>>>>>> conservative appraoch to introduce this feature, see how it goes,
>>>>> and
>>>>>>>>>>> maybe make it more advance later on. This also applies to what
>>>>>>>>>>> optimzation we may or may not allow (or are able to perform at
>>>>> all).
>>>>>>>>>>>
>>>>>>>>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
>>>>>>>>>>> Grouped`, I agree that it might not be a good idea.
>>>>>>>>>>>
>>>>>>>>>>> Atm, I see an enforces repartitioning as non-optimizable and as a
>>>>> good
>>>>>>>>>>> first step and I would suggest to not intoruce anything else for
>>>>> now.
>>>>>>>>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
>>>>>>>>>>> Grouped)` is something we could add later.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Therefore, I would not change `Grouped` but only introduce
>>>>>>>>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in
>>>>> to
>>>>>>>>>>> set the number of partitions, would need to rewrite their code to
>>>>>>>>>>> `selectKey(...).repartition(...).groupByKey()`. It's less
>>>>> convinient but
>>>>>>>>>>> also less risky from an API and optimization point of view.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> @Levani: about joins -> yes, we will need to check the specified
>>>>> number
>>>>>>>>>>> of partitions (if any) and if they don't match, throw an
>>>>> exception. We
>>>>>>>>>>> can discuss this on the PR -- I am just trying to get the PR for
>>>>> KIP-466
>>>>>>>>>>> merged -- your is next on the list :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thoughts?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>> Thank you all for an interesting discussion. This is very
>>>>> enlightening.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you Matthias for your explanation. Your arguments are very
>>>>> true.
>>>>>>>>> It makes sense that if user specifies number of partitions he/she
>>>>> really
>>>>>>>>> cares that those specifications are applied to internal topics.
>>>>>>>>>>>> Unfortunately, in current implementation this is not true during
>>>>>>>>> `join` operation. As I’ve written in the PR comment, currently, when
>>>>>>>>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max
>>>>> number of
>>>>>>>>> partitions from the two source topics.
>>>>>>>>>>>> I’m not really sure what would be the other way around this
>>>>> situation.
>>>>>>>>> Maybe fail the stream altogether and inform the user to specify same
>>>>> number
>>>>>>>>> of partitions?
>>>>>>>>>>>> Or we should treat join operations in a same way as it is right
>>>>> now
>>>>>>>>> and basically choose max number of partitions even when `repartition`
>>>>>>>>> operation is specified, because Kafka Streams “knows the best” how to
>>>>>>>>> handle joins?
>>>>>>>>>>>> You can check integration tests how it’s being handled currently.
>>>>> Open
>>>>>>>>> to suggestions on that part.
>>>>>>>>>>>>
>>>>>>>>>>>> As for groupBy, I agree and John raised very interesting points.
>>>>> My
>>>>>>>>> arguments for allowing users to specify number of partitions during
>>>>> groupBy
>>>>>>>>> operations mainly was coming from the usability perspective.
>>>>>>>>>>>> So building on top of what John said, maybe it makes sense to make
>>>>>>>>> `groupBy` operations smarter and whenever user specifies
>>>>>>>>> `numberOfPartitions` configuration, repartitioning will be enforced,
>>>>> wdyt?
>>>>>>>>>>>> I’m not going into optimization part yet :) I think it will be
>>>>> part of
>>>>>>>>> separate PR and task, but overall it makes sense to apply
>>>>> optimizations
>>>>>>>>> where number of partitions are the same.
>>>>>>>>>>>>
>>>>>>>>>>>> As for Repartitioned extending Grouped, I kinda feel that it
>>>>> won’t fit
>>>>>>>>> nicely in current API design.
>>>>>>>>>>>> In addition, in the PR review, John mentioned that there were a
>>>>> lot of
>>>>>>>>> troubles in the past trying to use one operation's configuration
>>>>> objects on
>>>>>>>>> other operations.
>>>>>>>>>>>> Also it makes sense to keep them separate in terms of
>>>>> compatibility.
>>>>>>>>>>>> In that case, we don’t have to worry every time Grouped is
>>>>> changed,
>>>>>>>>> what would be the implications on `repartition` operations.
>>>>>>>>>>>>
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <[email protected]>
>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ah, thanks for the clarification. I missed your point.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like the framework you've presented. It does seem simpler to
>>>>> assume
>>>>>>>>>>>>> that they either care about the partition count and want to
>>>>>>>>>>>>> repartition to realize it, or they don't care about the number.
>>>>>>>>>>>>> Returning to this discussion, it does seem unlikely that they
>>>>> care
>>>>>>>>>>>>> about the number and _don't_ care if it actually gets realized.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But then, it still seems like we can just keep the option as
>>>>> part of
>>>>>>>>>>>>> Grouped. As in:
>>>>>>>>>>>>>
>>>>>>>>>>>>> // user does not care
>>>>>>>>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
>>>>>>>>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
>>>>>>>>>>>>>
>>>>>>>>>>>>> // user does care
>>>>>>>>>>>>> stream.repartition(Repartitioned)
>>>>>>>>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
>>>>>>>>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
>>>>>>>>>>>>>
>>>>>>>>>>>>> ----
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above discussion got me thinking about algebra. Matthias is
>>>>>>>>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent
>>>>> to
>>>>>>>>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not
>>>>> convinced that
>>>>>>>>>>>>> we should force people to apply that expansion themselves vs.
>>>>> having a
>>>>>>>>>>>>> more compact way to express it if they don't care where exactly
>>>>> the
>>>>>>>>>>>>> repartition occurs. However, thinking about these operators
>>>>>>>>>>>>> algebraically can really help *us* narrow down the number of
>>>>> different
>>>>>>>>>>>>> expressions we have to consider.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let's consider some identities:
>>>>>>>>>>>>>
>>>>>>>>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>>>>>>>>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>>>>>>>>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
>>>>>>>>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>>>>>>>>>>>>> D: repartition = sink(managed) + src
>>>>>>>>>>>>>
>>>>>>>>>>>>> In these identities, I used one special identifier (...), which
>>>>> means
>>>>>>>>>>>>> any number (0+) of operations that are not src, mapKey,
>>>>> groupBy[Key],
>>>>>>>>>>>>> repartition, or agg.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
>>>>>>>>>>>>> operations are not executable. In other words, you have to get
>>>>> to a
>>>>>>>>>>>>> point where you can apply B to convert a groupByKey into a
>>>>> passthough
>>>>>>>>>>>>> in order to execute the program. This is just a formal way of
>>>>> stating
>>>>>>>>>>>>> what already happens in Kafka Streams.
>>>>>>>>>>>>>
>>>>>>>>>>>>> By applying A, we can just completely leave `groupBy` out of our
>>>>>>>>>>>>> analysis. It trivially decomposes into a mapKey followed by a
>>>>>>>>>>>>> groupByKey.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then, we can eliminate the "repartition required" case of
>>>>> `groupByKey`
>>>>>>>>>>>>> by applying C followed by D to get to the "no repartition
>>>>> required"
>>>>>>>>>>>>> version of groupByKey, which in turn sets us up to apply B to
>>>>> get an
>>>>>>>>>>>>> executable topology.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Fundamentally, you can think about KIP-221 is as proposing a
>>>>> modified
>>>>>>>>>>>>> D identity in which you can specify the partition count of the
>>>>> managed
>>>>>>>>>>>>> sink topic:
>>>>>>>>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since users _could_ apply the identities above, we don't
>>>>> actually have
>>>>>>>>>>>>> to add any partition count to groupBy[Key], but we decided early
>>>>> on in
>>>>>>>>>>>>> the KIP discussion that it's more ergonomic to add it. In that
>>>>> case,
>>>>>>>>>>>>> we also have to modify A and C:
>>>>>>>>>>>>> A': groupBy(mapper, pc) + agg
>>>>>>>>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
>>>>>>>>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>>>>>>>>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which sets us up still to always be able to get back to a plain
>>>>>>>>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>>>>>>>>>>>>> ultimately B to get an executable topology.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What about the optimizer?
>>>>>>>>>>>>> The optimizer applies another set of graph-algebraic identities
>>>>> to
>>>>>>>>>>>>> minimize the number of repartition topics in a topology.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (forgive my ascii art)
>>>>>>>>>>>>>
>>>>>>>>>>>>> E: (merging repartition nodes)
>>>>>>>>>>>>> (...) -> repartition -> X
>>>>>>>>>>>>> \-> repartition -> Y
>>>>>>>>>>>>> =
>>>>>>>>>>>>> (... + repartition) -> X
>>>>>>>>>>>>>   \-> Y
>>>>>>>>>>>>> F: (reordering around repartition)
>>>>>>>>>>>>> Where SVO is any non-key-changing, stateless, operation:
>>>>>>>>>>>>> repartition -> SVO = SVO -> repartition
>>>>>>>>>>>>>
>>>>>>>>>>>>> In terms of these identities, what the optimizer does is apply F
>>>>>>>>>>>>> repeatedly in either direction to a topology to factor out
>>>>> common in
>>>>>>>>>>>>> branches so that it can apply E to merge repartition nodes. This
>>>>> was
>>>>>>>>>>>>> especially necessary before KIP-221 because you couldn't directly
>>>>>>>>>>>>> express `repartition` in the DSL, only indirectly via
>>>>> `groupBy[Key]`,
>>>>>>>>>>>>> so there was no way to do the factoring manually.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We can now state very clearly that in KIP-221, explicit
>>>>>>>>>>>>> `repartition()` operators should create a "reordering barrier".
>>>>> So, F
>>>>>>>>>>>>> cannot be applied to an explicit `repartition()`. Also, I think
>>>>> we
>>>>>>>>>>>>> decided earlier that explicit `repartition()` operations would
>>>>> also be
>>>>>>>>>>>>> ineligible for merging, so E can't be applied to explicit
>>>>>>>>>>>>> `repartition()` operations either. I think we feel we _could_
>>>>> apply E
>>>>>>>>>>>>> without harm, but we want to be conservative for now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the salient point from the latter discussion has been
>>>>> that
>>>>>>>>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_
>>>>> constitute
>>>>>>>>>>>>> an explicit `repartition()` operator, and therefore, the
>>>>> resulting
>>>>>>>>>>>>> repartition node remains eligible for optimization.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To be clear, I agree with Matthias that the provided partition
>>>>> count
>>>>>>>>>>>>> _must_ be used in the resulting implicit repartition. This has
>>>>> some
>>>>>>>>>>>>> implications for E. Namely, E could only be applied to two
>>>>> repartition
>>>>>>>>>>>>> nodes that have the same partition count. This has always been
>>>>>>>>>>>>> trivially true before KIP-221 because the partition count has
>>>>> always
>>>>>>>>>>>>> been "unspecified", i.e., it would be determined at runtime by
>>>>> the
>>>>>>>>>>>>> user-managed-topics' partition counts. Now, it could be
>>>>> specified or
>>>>>>>>>>>>> unspecified. We can simply augment E to allow merging only
>>>>> repartition
>>>>>>>>>>>>> nodes where the partition count is EITHER "specified and the
>>>>> same on
>>>>>>>>>>>>> both sides", OR "unspecified on both sides".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for the long email, but I have a hope that it builds a
>>>>> solid
>>>>>>>>>>>>> theoretical foundation for our decisions in KIP-221, so we can
>>>>> have
>>>>>>>>>>>>> confidence that there are no edge cases for design flaws to hide.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> it seems like we do want to allow
>>>>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>>>> repartitioning
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Correct, ie, that is my suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "Use P partitions if repartitioning is necessary"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I disagree here, because my reasoning is that:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - if a user cares about the number of partition, the user wants
>>>>> to
>>>>>>>>>>>>>> enforce a repartitioning
>>>>>>>>>>>>>> - if a user does not case about the number of partitions, we
>>>>> don't
>>>>>>>>> need
>>>>>>>>>>>>>> to provide them a way to pass in a "hint"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hence, it should be sufficient to support:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // user does not care
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `stream.groupByKey(Grouped)`
>>>>>>>>>>>>>> `stream.grouBy(..., Grouped)`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // user does care
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
>>>>>>>>>>>>>> `streams.groupBy(..., Repartitioned)`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>>>>>>>>>>>>> Thanks for those thoughts, Matthias,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I find your reasoning about the optimization behavior
>>>>> compelling.
>>>>>>>>> The
>>>>>>>>>>>>>>> `through` operation is very simple and clear to reason about.
>>>>> It
>>>>>>>>> just
>>>>>>>>>>>>>>> passes the data exactly at the specified point in the topology
>>>>>>>>> exactly
>>>>>>>>>>>>>>> through the specified topic. Likewise, if a user invokes a
>>>>>>>>>>>>>>> `repartition` operator, the simplest behavior is if we just do
>>>>> what
>>>>>>>>>>>>>>> they asked for.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Stepping back to think about when optimizations are surprising
>>>>> and
>>>>>>>>>>>>>>> when they aren't, it occurs to me that we should be free to
>>>>> move
>>>>>>>>>>>>>>> around repartitions when users have asked to perform some
>>>>> operation
>>>>>>>>>>>>>>> that implies a repartition, like "change keys, then filter,
>>>>> then
>>>>>>>>>>>>>>> aggregate". This program requires a repartition, but it could
>>>>> be
>>>>>>>>>>>>>>> anywhere between the key change and the aggregation. On the
>>>>> other
>>>>>>>>>>>>>>> hand, if they say, "change keys, then filter, then
>>>>> repartition, then
>>>>>>>>>>>>>>> aggregate", it seems like they were pretty clear about their
>>>>> desire,
>>>>>>>>>>>>>>> and we should just take it at face value.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, I'm sold on just literally doing a repartition every time
>>>>> they
>>>>>>>>>>>>>>> invoke the `repartition` operator.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is
>>>>> more
>>>>>>>>> nuanced.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What you said about `groupByKey` makes sense. If they key
>>>>> hasn't
>>>>>>>>>>>>>>> actually changed, then we don't need to repartition before
>>>>>>>>>>>>>>> aggregating. On the other hand, `groupBy` is specifically
>>>>> changing
>>>>>>>>> the
>>>>>>>>>>>>>>> key as part of the grouping operation, so (as you said) we
>>>>>>>>> definitely
>>>>>>>>>>>>>>> have to do a repartition.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If I'm reading the discussion right, it seems like we do want
>>>>> to
>>>>>>>>> allow
>>>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>>>> repartitioning
>>>>>>>>> if
>>>>>>>>>>>>>>> it's not needed. That last clause is the key. "Use P
>>>>> partitions if
>>>>>>>>>>>>>>> repartitioning is necessary" is a directive that applies
>>>>> cleanly and
>>>>>>>>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call
>>>>> the
>>>>>>>>>>>>>>> option `numberOfPartitionsHint`, which along with the "if
>>>>> necessary"
>>>>>>>>>>>>>>> javadoc, should make it clear that the option won't force a
>>>>>>>>>>>>>>> repartition, and also gives us enough latitude to still employ
>>>>> the
>>>>>>>>>>>>>>> optimizer on those repartition topics?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we like the idea of expressing it as a "hint" for grouping
>>>>> and a
>>>>>>>>>>>>>>> "command" for `repartition`, then it seems like it still makes
>>>>> sense
>>>>>>>>>>>>>>> to keep Grouped and Repartitioned separate, as they would
>>>>> actually
>>>>>>>>>>>>>>> offer different methods with distinct semantics.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry for late reply.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I guess, the question boils down to the intended semantics of
>>>>>>>>>>>>>>>> `repartition()`. My understanding is as follows:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using
>>>>> the
>>>>>>>>>>>>>>>> upstream topic to determine the number of partitions)
>>>>>>>>>>>>>>>> - KS does auto-repartitioning only for downstream DSL
>>>>> operators
>>>>>>>>> like
>>>>>>>>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
>>>>>>>>> auto-repartitioning
>>>>>>>>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>>>>>>>>>>>>> - KS offers `through()` to enforce a repartitioning --
>>>>> however,
>>>>>>>>> the user
>>>>>>>>>>>>>>>> needs to create the topic manually (with the desired number of
>>>>>>>>> partitions).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see two main applications for `repartitioning()`:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) repartition data before a `transform()` but user does not
>>>>> want
>>>>>>>>> to
>>>>>>>>>>>>>>>> manage the topic
>>>>>>>>>>>>>>>> 2) scale out a downstream subtopology
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a
>>>>> users
>>>>>>>>> calls
>>>>>>>>>>>>>>>> it, a repartitining is enforced, with the difference that KS
>>>>>>>>> manages the
>>>>>>>>>>>>>>>> topic and the user does not need to create it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This behavior makes (1) and (2) possible.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>>>> repartition
>>>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Because of (2), I disagree. Either a user does not care about
>>>>>>>>> scaling
>>>>>>>>>>>>>>>> out, for which case she would not specify the number of
>>>>>>>>> partitions. Or a
>>>>>>>>>>>>>>>> user does care, and hence wants to enforce the scale out. I
>>>>> don't
>>>>>>>>> think
>>>>>>>>>>>>>>>> that any user would say, "maybe scale out".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Therefore, the optimizer should never ignore the repartition
>>>>>>>>> operation.
>>>>>>>>>>>>>>>> As a "consequence" (because repartitioning is expensive) a
>>>>> user
>>>>>>>>> should
>>>>>>>>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking
>>>>> an
>>>>>>>>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be
>>>>> "dangerous"
>>>>>>>>>>>>>>>> because it might be too subtle and an "optional scaling out"
>>>>> as
>>>>>>>>> laid out
>>>>>>>>>>>>>>>> above does not make sense IMHO.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am also not worried about "over repartitioning" because the
>>>>>>>>> result
>>>>>>>>>>>>>>>> stream would never trigger auto-repartitioning. Only if
>>>>> multiple
>>>>>>>>>>>>>>>> consecutive calls to `repartition()` are made it could be bad
>>>>> --
>>>>>>>>> but
>>>>>>>>>>>>>>>> that's the same with `through()`. In the end, there is always
>>>>> some
>>>>>>>>>>>>>>>> responsibility on the user.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be
>>>>> required,
>>>>>>>>>>>>>>>> however, for `groupByKey()` it depends if the KStream is
>>>>> marked as
>>>>>>>>>>>>>>>> `repartitioningRequired`.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a
>>>>> user to
>>>>>>>>> set
>>>>>>>>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different
>>>>> story,
>>>>>>>>>>>>>>>> because calling
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `repartition().groupBy()`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> does not achieve what we want. Hence, allowing users to pass
>>>>> in the
>>>>>>>>>>>>>>>> number of users partitions into `groupBy()` does actually
>>>>> makes
>>>>>>>>> sense,
>>>>>>>>>>>>>>>> because repartitioning will happen anyway and thus we can
>>>>>>>>> piggyback a
>>>>>>>>>>>>>>>> scaling decision.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think that John has a fair concern about the overloads,
>>>>> however,
>>>>>>>>> I am
>>>>>>>>>>>>>>>> not convinced that using `Grouped` to specify the number of
>>>>>>>>> partitions
>>>>>>>>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned`
>>>>> and
>>>>>>>>> both
>>>>>>>>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I
>>>>> am
>>>>>>>>>>>>>>>> wondering if we could bridge the gap between both, if we
>>>>> would make
>>>>>>>>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>>>>>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what
>>>>> seems to
>>>>>>>>> make
>>>>>>>>>>>>>>>> the API quite smooth:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `stream.groupBy(..., Grouped...)`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>> Hi Sophie,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
>>>>>>>>> hearing others opinion as well on this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>>>> other hand
>>>>>>>>>>>>>>>>>> Kafka Streams has already
>>>>>>>>>>>>>>>>>>> optimizer in place which alters topology independently
>>>>> from user
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree (with you) and think this is a good way to put it
>>>>> -- we
>>>>>>>>> currently
>>>>>>>>>>>>>>>>>> auto-repartition for the user so
>>>>>>>>>>>>>>>>>> that they don't have to walk through their entire topology
>>>>> and
>>>>>>>>> reason about
>>>>>>>>>>>>>>>>>> when and where to place a
>>>>>>>>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly
>>>>> force
>>>>>>>>> this onto the
>>>>>>>>>>>>>>>>>> user? How certain are we that
>>>>>>>>>>>>>>>>>> users will always get this right? It's easy to imagine that
>>>>>>>>> during
>>>>>>>>>>>>>>>>>> development, you write your new app with
>>>>>>>>>>>>>>>>>> correctly placed repartitions in order to use this new
>>>>> feature.
>>>>>>>>> During the
>>>>>>>>>>>>>>>>>> course of development you end up
>>>>>>>>>>>>>>>>>> tweaking the topology, but don't remember to review or move
>>>>> the
>>>>>>>>>>>>>>>>>> repartitioning since you're used to Streams
>>>>>>>>>>>>>>>>>> doing this for you. If you use only single-partition topics
>>>>> for
>>>>>>>>> testing,
>>>>>>>>>>>>>>>>>> you might not even notice your app is
>>>>>>>>>>>>>>>>>> spitting out incorrect results!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
>>>>>>>>> introduce a new
>>>>>>>>>>>>>>>>>> feature and say that to use it, you can't take
>>>>>>>>>>>>>>>>>> advantage of this other feature anymore. Also, is it
>>>>> possible our
>>>>>>>>>>>>>>>>>> optimization framework could ever include an
>>>>>>>>>>>>>>>>>> optimized repartitioning strategy that is better than what a
>>>>>>>>> user could
>>>>>>>>>>>>>>>>>> achieve by manually inserting repartitions?
>>>>>>>>>>>>>>>>>> Do we expect users to have a deep understanding of the best
>>>>> way
>>>>>>>>> to
>>>>>>>>>>>>>>>>>> repartition their particular topology, or is it
>>>>>>>>>>>>>>>>>> likely they will end up over-repartitioning either due to
>>>>> missed
>>>>>>>>>>>>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>>>> repartition
>>>>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped
>>>>> rather
>>>>>>>>> than
>>>>>>>>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in
>>>>> line
>>>>>>>>> with the
>>>>>>>>>>>>>>>>>> current syntax so +1 from me
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under
>>>>> review and
>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP
>>>>> to
>>>>>>>>> address
>>>>>>>>>>>>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
>>>>>>>>> control over
>>>>>>>>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is
>>>>> done
>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which
>>>>> is
>>>>>>>>> planned to
>>>>>>>>>>>>>>>>>>> be separate PR.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> While all agree about general implementation and idea
>>>>> behind
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR,
>>>>> introducing new
>>>>>>>>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
>>>>>>>>> questions during
>>>>>>>>>>>>>>>>>>> the review.
>>>>>>>>>>>>>>>>>>> Matthias raised concern that there can be cases when user
>>>>> uses
>>>>>>>>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
>>>>>>>>> repartitioning may
>>>>>>>>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
>>>>>>>>> would never be
>>>>>>>>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted
>>>>> your
>>>>>>>>> comment).
>>>>>>>>>>>>>>>>>>> So instead, if user wants to control parallelism of
>>>>>>>>> sub-topologies, he or
>>>>>>>>>>>>>>>>>>> she should always use `KStream#repartition` operation
>>>>> before
>>>>>>>>> groupBy. Full
>>>>>>>>>>>>>>>>>>> comment can be seen here:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On the same topic, John pointed out that, from API design
>>>>>>>>> perspective, we
>>>>>>>>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
>>>>>>>>> operators between
>>>>>>>>>>>>>>>>>>> one another. So instead of introducing new
>>>>>>>>> `KStream#groupBy(Repartitioned)`
>>>>>>>>>>>>>>>>>>> for specifying number of partitions for internal topic, we
>>>>>>>>> should update
>>>>>>>>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>>>> other hand
>>>>>>>>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
>>>>>>>>> topology
>>>>>>>>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
>>>>>>>>> Streams,
>>>>>>>>>>>>>>>>>>> internally would optimize topology in the best way
>>>>> possible,
>>>>>>>>> even if in
>>>>>>>>>>>>>>>>>>> some cases this means ignoring some operator configurations
>>>>>>>>> passed by the
>>>>>>>>>>>>>>>>>>> user. Also, I agree with John about API design semantics.
>>>>> If we
>>>>>>>>> go through
>>>>>>>>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes
>>>>> more
>>>>>>>>> sense to
>>>>>>>>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead
>>>>> of
>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
>>>>>>>>> don't have
>>>>>>>>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>>>>> spare time at the
>>>>>>>>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
>>>>>>>>> things, but
>>>>>>>>>>>>>>>>>>> feel
>>>>>>>>>>>>>>>>>>>> free to pick up something else
>>>>>>>>>>>>>>>>>>>> to work on in the meantime until things have calmed down
>>>>> a bit!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like
>>>>> to get
>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> attention on this PR:
>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>>>>>>>>>>>>> It's been a while now and I would love to move on to
>>>>> other
>>>>>>>>> KIPs as well.
>>>>>>>>>>>>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://www.mail-archive.com/[email protected]/msg99680.html <
>>>>>>>>>>>>>>>>>>>
>>>>> https://www.mail-archive.com/[email protected]/msg99680.html>
>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://www.mail-archive.com/[email protected]/msg99680.html <
>>>>>>>>>>>>>>>>>>>
>>>>> https://www.mail-archive.com/[email protected]/msg99680.html
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion
>>>>> on
>>>>>>>>> that one.
>>>>>>>>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a
>>>>> good
>>>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
>>>>>>>>> numberOfPartitions)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
>>>>>>>>> `#numberOfPartitions(int)` in
>>>>>>>>>>>>>>>>>>>>>>>> combination with `withName()` and
>>>>>>>>> `withNumberOfPartitions()` should
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
>>>>>>>>> small if
>>>>>>>>>>>>>>>>>>>>> possible.)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re
>>>>> any
>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
>>>>>>>>> "extension" to
>>>>>>>>>>>>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to
>>>>> think
>>>>>>>>> about it and
>>>>>>>>>>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to
>>>>> avoid
>>>>>>>>>>>>>>>>>>> abbreviations.
>>>>>>>>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more
>>>>> `static`
>>>>>>>>> methods:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>> Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it
>>>>> makes
>>>>>>>>> sense to
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>> some duplicate configurations between operators in order
>>>>> to
>>>>>>>>> keep API
>>>>>>>>>>>>>>>>>>> simple
>>>>>>>>>>>>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to
>>>>> have
>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
>>>>>>>>> separate was
>>>>>>>>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
>>>>>>>>> producer level
>>>>>>>>>>>>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman
>>>>> <
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
>>>>>>>>> producer level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
>>>>>>>>> considerations
>>>>>>>>>>>>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just
>>>>> as
>>>>>>>>> much a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have
>>>>> two
>>>>>>>>> producers
>>>>>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
>>>>>>>>> they are
>>>>>>>>>>>>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
>>>>>>>>> part. So in a
>>>>>>>>>>>>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
>>>>>>>>> producers.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams
>>>>> API is
>>>>>>>>> not always
>>>>>>>>>>>>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
>>>>>>>>> operator are
>>>>>>>>>>>>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We
>>>>> could
>>>>>>>>> instead
>>>>>>>>>>>>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
>>>>>>>>> parameter to keep
>>>>>>>>>>>>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
>>>>>>>>> increases the
>>>>>>>>>>>>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to
>>>>> just
>>>>>>>>> say "this
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
>>>>>>>>> without worrying
>>>>>>>>>>>>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
>>>>>>>>> field of
>>>>>>>>>>>>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
>>>>>>>>> elsewhere in
>>>>>>>>>>>>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani
>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>>>>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class
>>>>> for
>>>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer
>>>>> level
>>>>>>>>>>>>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in
>>>>> this
>>>>>>>>> thread).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
>>>>>>>>> interface, I
>>>>>>>>>>>>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda
>>>>> breaking
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
>>>>>>>>> encapsulates topic
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal
>>>>> topics is
>>>>>>>>> viable
>>>>>>>>>>>>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.
>>>>> In
>>>>>>>>> addition to
>>>>>>>>>>>>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
>>>>>>>>> hash-based
>>>>>>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
>>>>>>>>> different than the
>>>>>>>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
>>>>>>>>> conjunction with
>>>>>>>>>>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
>>>>>>>>> Produced hence
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes
>>>>> sense
>>>>>>>>> for API
>>>>>>>>>>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani
>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
>>>>>>>>> Repartitioned class,
>>>>>>>>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of
>>>>> partitions for
>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure
>>>>> co-partitioning
>>>>>>>>> before join
>>>>>>>>>>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind.
>>>>> Again,
>>>>>>>>> not totally
>>>>>>>>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar
>>>>> is
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> rationale
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani
>>>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your
>>>>> example
>>>>>>>>> I would
>>>>>>>>>>>>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper
>>>>> partitioner
>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with
>>>>> and
>>>>>>>>> then do the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler
>>>>> <
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> <mailto:
>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> <mailto:[email protected]
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an
>>>>> option to
>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is
>>>>> that
>>>>>>>>> some data may
>>>>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
>>>>>>>>> topic. If the
>>>>>>>>>>>>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
>>>>>>>>> strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
>>>>>>>>> partitioned with the
>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
>>>>>>>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To
>>>>> be
>>>>>>>>> honest I’m
>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
>>>>>>>>> think I would
>>>>>>>>>>>>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka
>>>>> Streams
>>>>>>>>> knows best
>>>>>>>>>>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should
>>>>> be
>>>>>>>>> used only for
>>>>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet,
>>>>> but I
>>>>>>>>> guess
>>>>>>>>>>>>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
>>>>>>>>> agreement on
>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
>>>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>>>> [email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
>>>>>>>>> Repartitioned
>>>>>>>>>>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also
>>>>> need
>>>>>>>>> to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field,
>>>>> similar
>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but
>>>>> seems
>>>>>>>>> it should
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by
>>>>> Kafka
>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user
>>>>> more
>>>>>>>>> control over
>>>>>>>>>>>>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
>>>>>>>>> something that
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
>>>>>>>>> parallelism by
>>>>>>>>>>>>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
>>>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>>>> [email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>> <mailto:
>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
>>>>>>>>> thing for me
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems
>>>>> to be
>>>>>>>>> the case
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it
>>>>> have to
>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version
>>>>> of
>>>>>>>>> the method
>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected] <mailto:
>>>>> [email protected]>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing
>>>>> list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
>>>>>>>>> opinions/concerns
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward,
>>>>> I’ve
>>>>>>>>> updated
>>>>>>>>>>>>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected
>>>>> Alternatives”
>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP
>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
>>>>>>>>> dedicated `Topic`
>>>>>>>>>>>>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators
>>>>> like
>>>>>>>>>>>>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others
>>>>> opinion
>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM,
>>>>> Matthias
>>>>>>>>> J. Sax <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP!
>>>>> And
>>>>>>>>> thanks for
>>>>>>>>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
>>>>>>>>> discussed
>>>>>>>>>>>>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a
>>>>> good
>>>>>>>>> summary to
>>>>>>>>>>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes
>>>>> sense.
>>>>>>>>> With regard
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage
>>>>> topics
>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument,
>>>>> it
>>>>>>>>> might
>>>>>>>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator
>>>>> that
>>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
>>>>>>>>> operator (an
>>>>>>>>>>>>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
>>>>>>>>> Otherwise, there is
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
>>>>>>>>> topics but
>>>>>>>>>>>>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make
>>>>> the
>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
>>>>>>>>> `Grouped`. I am
>>>>>>>>>>>>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does
>>>>> apply
>>>>>>>>> and adding
>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
>>>>>>>>> opinions of
>>>>>>>>>>>>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I
>>>>> prefer
>>>>>>>>> atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration
>>>>> objects
>>>>>>>>> that allow
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer,
>>>>> producer,
>>>>>>>>> store. If we
>>>>>>>>>>>>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if
>>>>> we
>>>>>>>>> should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()`
>>>>> class
>>>>>>>>> and method
>>>>>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to
>>>>> add
>>>>>>>>> other
>>>>>>>>>>>>>>>>>>> configs,
>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc,
>>>>> easily,
>>>>>>>>> without the
>>>>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not
>>>>> sure
>>>>>>>>> if I like
>>>>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani
>>>>> Kokhreidze
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
>>>>>>>>> best approach.
>>>>>>>>>>>>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class
>>>>> with
>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation.
>>>>> Since
>>>>>>>>> KStream#to is
>>>>>>>>>>>>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be
>>>>> reasonable
>>>>>>>>>>>>>>>>>>> assumption
>>>>>>>>>>>>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in
>>>>> advance. And
>>>>>>>>> in that
>>>>>>>>>>>>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make
>>>>> much
>>>>>>>>> sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class,
>>>>> based
>>>>>>>>> on API
>>>>>>>>>>>>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something
>>>>> that
>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer,
>>>>> partitioner
>>>>>>>>> those all
>>>>>>>>>>>>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions
>>>>> is
>>>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer
>>>>> level
>>>>>>>>>>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
>>>>>>>>> seems like
>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with
>>>>> non-internal
>>>>>>>>> (e.g
>>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics
>>>>> and I
>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into
>>>>> account,
>>>>>>>>> I think we
>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
>>>>>>>>> Streams should
>>>>>>>>>>>>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
>>>>>>>>> topics that
>>>>>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
>>>>>>>>> numPartitions
>>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and
>>>>> producer
>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name
>>>>> optional
>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and
>>>>> giving
>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more
>>>>> useful :)
>>>>>>>>>>>>>>>>>>> Considering
>>>>>>>>>>>>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
>>>>>>>>> class, I see two
>>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions)
>>>>>>>>> - topic
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions,
>>>>>>>>> final
>>>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>>>> generated
>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it
>>>>> is and
>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
>>>>>>>>> suggested this
>>>>>>>>>>>>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
>>>>>>>>> propose the
>>>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>>>> configuration to
>>>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method
>>>>> overloads to
>>>>>>>>>>>>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions)
>>>>>>>>> - topic
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions,
>>>>>>>>> final
>>>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>>>> generated
>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>>>> configuration to
>>>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
>>>>>>>>> KStream#repartition for
>>>>>>>>>>>>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the
>>>>> threads
>>>>>>>>> that were
>>>>>>>>>>>>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM,
>>>>> Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:
>>>>>>>>>>>>>>>>>>> [email protected]>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect
>>>>> discussion
>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems
>>>>> to
>>>>>>>>> agreement
>>>>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation,
>>>>> as far
>>>>>>>>> as I
>>>>>>>>>>>>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
>>>>>>>>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
>>>>>>>>> numOfPartitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
>>>>>>>>> users to
>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing
>>>>> stateful
>>>>>>>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
>>>>>>>>> changes around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>
> 
>
signature.asc
Description: OpenPGP digital signature
Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to