Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

John Roesler Fri, 11 Jan 2019 12:39:35 -0800

Hi Jan,

Thanks for the reply.


It sounds like your larger point is that if we provide a building block
instead of the whole operation, then it's not too hard for users to
implement the whole operation, and maybe the building block is
independently useful.

This is a very fair point. In fact, it's not exclusive with the current
plan,
in that we can always add the "building block" version in addition to,
rather than instead of, the full operation. It very well might be a mistake,
but I still prefer to begin by introducing the fully encapsulated operation
and subsequently consider adding the "building block" version if it turns
out that the encapsulated version is insufficient.

IMHO, one of Streams's strengths over other processing frameworks
is a simple API, so simplicity as a design goal seems to suggest that:
> a.tomanyJoin(B)
is preferable to
> a.map(retain(key and FK)).tomanyJoin(B).groupBy(a.key()).join(A)
at least to start with.

To answer your question about my latter potential optimization,
no I don't have any code to look at. But, yes, the implementation
would bring B into A's tasks and keep them in a state store for joining.
Thanks for that reference, it does indeed sound similar to what
MapJoin does in Hive.

Thanks again,
-John

On Mon, Jan 7, 2019 at 5:06 PM Jan Filipiak <jan.filip...@trivago.com>
wrote:

>
>
> On 02.01.2019 23:44, John Roesler wrote:
> > However, you seem to have a strong intuition that the scatter/gather
> > approach is better.
> > Is this informed by your actual applications at work? Perhaps you can
> > provide an example
> > data set and sequence of operations so we can all do the math and agree
> > with you.
> > It seems like we should have a convincing efficiency argument before
> > choosing a more
> > complicated API over a simpler one.
>
> The way I see this is simple. If we only provide the basic
> implementation of 1:n join (repartition by FK, Range scan on Foreign
> table update). Then this is such a fundamental building block.
>
> I do A join B.
>
> a.map(retain(key and FK)).tomanyJoin(B).groupBy(a.key()).join(A). This
> pretty much performs all your "wire saving optimisations". I don't know!
> to be honest if someone did put this ContextAwareMapper() that was
> discussed at some point. Then I could actually do the high watermark
> thing. a.contextMap(reatain(key, fk and offset).
> omanyJoin(B).aggregate(a.key(), oldest offset wins).join(A).
> I don't find the KIP though. I guess it didn't make it.
>
> After the repartition and the range read the abstraction just becomes to
> weak. I just showed that your implementation is my implementation with
> stuff around it.
>
> I don't know if your scatter gather thing is in code somewhere. If the
> join will only be applied after the gather phase I really wonder where
> we get the other record from? do you also persist the foreign table on
> the original side? If that is put into code somewhere already?
>
> This would essentially bring B to each of the A's tasks. Factors for
> this in my case a rather easy and dramatic. Nevertheless an approach I
> would appreciate. In Hive this could be something closely be related to
> the concept of a MapJoin. Something I whish we had in streams. I often
> stated that at some point we need unbounded ammount off offsets per
> topicpartition and group :D Sooooo good.
>
> Long story short. I hope you can follow my line of thought. I hope you
> can clarify my missunderstanding how the join is performed on A side
> without materializing B there.
>
> I would love if streams would get it right. The basic rule I always say
> is do what Hive does. done.
>
>
> >
> > Last thought:
> >> Regarding what will be observed. I consider it a plus that all events
> >> that are in the inputs have an respective output. Whereas your solution
> >> might "swallow" events.
> >
> > I didn't follow this. Following Adam's example, we have two join
> results: a
> > "dead" one and
> > a "live" one. If we get the dead one first, both solutions emit it,
> > followed by the live result.
>
> there might be multiple dead once in flight right? But it doesn't really
> matter, I never did something with the extra benefit i mentioned.
>

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

Reply via email to