[GitHub] [kafka] bbejeck commented on issue #8504: KAFKA-9298: reuse mapped stream error in joins

2020-04-23 Thread GitBox


bbejeck commented on issue #8504:
URL: https://github.com/apache/kafka/pull/8504#issuecomment-618450970


   @mjsax thanks for the review!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] bbejeck commented on issue #8504: KAFKA-9298: reuse mapped stream error in joins

2020-04-23 Thread GitBox


bbejeck commented on issue #8504:
URL: https://github.com/apache/kafka/pull/8504#issuecomment-618447861


   >Why that? Because such a topology would hit the bug, it could never be 
deployed, and thus nobody can actually run such a topology? In fact, shouldn't 
be "burn" and index even if a name is provided (IIRC, we do this for some 
cases)?
   
   Yes in some cases we increment the index when users provide names.  But 
right now we don't increment the counter at all when creating repartition 
topics as we reuse the name as is.  My main point is that if we started to 
generate new names for repartition topics we'd create topology compatibility 
issues as the newly generated name would bump the count for all downstream 
nodes.  Right now I'm leaning towards going with the solution you presented in 
point one in https://github.com/apache/kafka/pull/8504#discussion_r413380852 
   
   >I agree thought, that merging repartition topics (as proposed in (1)) 
should be done if possible (it's a historic artifact that we did not merge them 
in the past and IMHO we should not make the same mistake again?).
   
   But by doing so we are "leaking" optimization logic as you pointed out 
above.  I'm leaning towards building the topology "as is", meaning create two 
repartition topics if that's what is required.  But I don't have a strong 
opinion and I would be fine with keeping the current solution in this PR.
   
   >For (2), it's a tricky question because the different names are used for 
different stores and changelog topics (ie, main purpose?) -- it seems to be a 
"nasty side effect" if we would end up with two repartition topics for this 
case? Of course, given the new repartition() operator, a user can work around 
it by using it after map() and before calling join(). Just brainstorming here 
what the impact could be and what tradeoff we want to pick.
   
   I'm not sure I follow here the "nasty side effect" comment.
   If a user does `streamA.join(streamB, ..., StreamJoined.name("foo")` and 
`streamA.join(streamC, ..., StreamJoined.name("bar")`  then we should create 
two repartiton topics as that's what the user is expecting.  If they elect to 
use optimization then removing redundant repartition topics is expected 
behavior.  I think this also goes back to your original comment about the 
leaking of optimization details.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] bbejeck commented on issue #8504: KAFKA-9298: reuse mapped stream error in joins

2020-04-20 Thread GitBox


bbejeck commented on issue #8504:
URL: https://github.com/apache/kafka/pull/8504#issuecomment-616620722


   ping @mjsax and @vvcephei, I know you guys are busy, but do you mind having 
a look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org