Re: [DISCUSS] - KIP-314: KTable to GlobalKTable Bi-directional Join

Guozhang Wang Wed, 20 Jun 2018 15:45:03 -0700

Hello Adam,

Thanks for proposing the KIP. A few meta comments:

1. As Matthias mentioned, the current GlobalKTable is designed to be
read-only, and not driving any computations (btw the global store backing a
GlobalKTable should also be read-only). Behind the scene the global store
updating task and the regular streams task are two separate ones running
two separate processor topologies by two threads: the global store updating
task's topology is simply a source node, plus a processor node (let's call
it the update-processor) that puts to the store. If we allow the
GlobalKTable to drive the join, then we need the underlying global store's
update processor to link to the downstream processors of the normal regular
task's topology in order to pass the joined results to downstream. It means
the two topologies will be merged, and that merged topology can only be
executed as a single task, by a single thread. We need to think of a way
how to work around this issue first of all before proceeding to next steps.

2. Not clear what do you mean by "In terms of data complexity, any pattern
that requires us to rekey the data once is equivalent in terms of data
capacity requirements.." do you mean that although we have a duplicated
state store: ModifiedEvents in addition to the original Events with only
the enhanced key, this is not avoidable anyways even if we do re-keying?
Note that in
https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable?preview=/74684836/74687529/Screenshot%20from%202017-11-18%2023%3A26%3A52.png
we were considering if it is still possible to only materialize the joining
tables once each still, i.e. not having a duplicated store. So I think it
is not necessarily the case that we have to duplicate the KTable's store.

One minor comment:

1. In `*KTable as Driver, joined on GlobalKTable join mechanism`* section,
I think we still need to join the old value with the global store to form a
pair of "<new / old>" joined result, so that the resulting KTable can still
be applied in another aggregation operator that allows correct addition /
subtraction logic.

2. For KTable-KTable join, we have inner / left / outer, while for
KStream-KTable / GlobalKTable join we only have inner / left, and the
reason is that for stream-table joins outer join makes less sense; should
we consider outer for KTable-GlobalKTable join as well?

Guozhang

On Tue, Jun 19, 2018 at 10:27 AM, Adam Bellemare <adam.bellem...@gmail.com>
wrote:

> Matthias
>
> Thanks for the links. I have seen those before but I will dig deeper into
> them, especially around the CombinedKey and the flush + cache + rangescan
> functionality. I believe Jan had a PR with many of the changes in there,
> perhaps I can use some of the work that was done there to help leverage a
> similar (or identical) design.
>
> I will certainly be able to make a PoC before going to vote on this one. It
> is a larger change and I suspect that we will need to review some of the
> finer points to ensure that the design is still suitable and sufficiently
> performant. I'll post back when I have something more concrete, but in the
> meantime I welcome all other concerns and comments.
>
> Thanks
>
>
>
> On Mon, Jun 18, 2018 at 10:05 PM, Matthias J. Sax <matth...@confluent.io>
> wrote:
>
> > Adam,
> >
> > thanks a lot for the KIP. I agree that this would be a valuable feature
> > to add. It's a very complex one though. You correctly pointed out, that
> > the GlobalKTable (or global stores in general) cannot be the "driver"
> > atm and are passively updated only. This is by design. Are you familiar
> > with the KIP discussion of KIP-99?
> > (https://cwiki.apache.org/confluence/pages/viewpage.
> action?pageId=67633649
> > )
> > Would be worth to refresh to understand the tradeoffs and design
> decisions.
> >
> > It's unclear to me, what the impact will be if we want to change the
> > current design. Even if no GlobalKTable is used, it might have impact on
> > performance and for sure on code complexity. Overall, it seems that a
> > POC might be required before we can consider adding this (with the
> > danger, that it does not get accepted in the end).
> >
> > Are you aware of KIP-213:
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 213+Support+non-key+joining+in+KTable
> >
> > It suggest to add non-key joins and a lot of issues how to implement
> > this were discussed already. As a KTable-GloblKTable join is a non-key
> > join, too, it seems that those discussion apply to your KIP too.
> >
> > Hope this helps to make the next steps.
> >
> >
> > -Matthias
> >
> >
> > On 6/18/18 1:15 PM, Adam Bellemare wrote:
> > > Hi All
> > >
> > > I created KIP-314 and I would like to initiate a discussion on it.
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 314%3A+KTable+to+GlobalKTable+Bi-directional+Join
> > >
> > > The primary goal of this KIP is to improve the way that Kafka can deal
> > with
> > > relational data at scale. This KIP would alter the way that
> GlobalKTables
> > > can be used in relation to KTables. I believe that this would be a very
> > > useful change but I need some eyes on the technical aspects to validate
> > or
> > > refute the strategy.
> > >
> > > Thanks
> > >
> > > Adam Bellemare
> > >
> >
> >
>

-- 
-- Guozhang

Re: [DISCUSS] - KIP-314: KTable to GlobalKTable Bi-directional Join

Reply via email to