Re: [DISCUSS] KIP-372: Naming Joins and Grouping

John Roesler Thu, 13 Sep 2018 15:49:48 -0700

Hey all,

I think it's slightly out of scope for this KIP, but I'm not sure it's
right to add a name to ValueJoiner or KeyValueMapper.
Both of those are "functional interfaces", that is, they are basically
named functions.


It seems like we should preserve this property both to provide a clean
separation between the operation *logic* and the operator *config*.

It's a good point that `Joined` doesn't appear in the KTable join. However,
KTable join does have the variant that accepts "Materialized".
This makes it similar to the grouping operators that currently accept
"Serialized", namely the operator is already configurable, but only in
a narrow way (we can only configure the materialization of the table or the
serialization of the grouping).

Consider as an alternative the KStream.join operation that takes a config
object called "Joined". This implies no more than that we are
configuring the join operation. If we are deciding to allow adding a name
to the operation, it's natural to add it to the operation's config.

Conversely, we also want to name the KTable.join operation, not the
materialization itself (which already has a name with separate semantics).

To me, this suggests that, just like we propose to subsume the Serialized
config for grouping into a more general config called Grouped, we should
also want to do something similar and replace `KTable#join(KTable<>,
ValueJoiner<>, Materialized<>)` with one taking a more general
configuration, maybe:
`KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?

Does this make sense?

Thanks,
-John

On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <[email protected]> wrote:

> Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> `groupBy/groupByKey` naming schemes, and for joins its current proposal is
> to extend ValueJoiner with Named and hence this part is what I meant to
> have "overlaps".
>
> Thinking about it a bit more, since Joined is only used for S-S and S-T
> joins but not T-T joins, having the naming schemes on Joined would not be
> sufficient, and extending ValueJoiner would indeed be a good choice.
>
> As for groupBy, since it is using KeyValueMapper which is supposed to be
> extended with Named in KIP-307, it does not require extending to processor
> nodes as well.
>
>
> Given this, I'm fine with limiting the scope to only repartition topics.
>
>
> Guozhang
>
> On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <[email protected]>
> wrote:
>
> > Follow up comments:
> >
> > 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> > `app-id]-[join-name]-left|right-repartition` but we should not change
> > the pattern depending if the user specifies a name of not. I am fine
> > with both patterns---just want to make sure with stick with one.
> >
> > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > to be orthogonal, and thus KIP-372 should not change any processor
> > names, but KIP-307 should define a holistic strategy for all processor.
> > Otherwise, we might up with different strategies or revert what we
> > decide in this KIP if it's not compatible with KIP-307.
> >
> >
> > -Matthias
> >
> >
> > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > Hello Bill,
> > >
> > > I made a pass over your proposal and here are some questions:
> > >
> > > 1. For Joined names, the current proposal is to define the repartition
> > > topic names as
> > >
> > > * [app-id]-this-[join-name]-repartition
> > >
> > > * [app-id]-other-[join-name]-repartition
> > >
> > >
> > > And if [join-name] not specified, stay the same, which is:
> > >
> > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> > join
> > > and S-T join
> > >
> > > I think it is more natural to rename it to
> > >
> > > * [app-id]-[join-name]-left-repartition
> > >
> > > * [app-id]-[join-name]-right-repartition
> > >
> > >
> > > 2. I'd suggest to use the name to also define the corresponding
> processor
> > > names accordingly, in addition to the repartition topic names. Note
> that
> > > for joins, this may be overlapping with KIP-307
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > as
> > > it also have proposals for defining processor names for join operators
> as
> > > well.
> > >
> > > 3. Could you also specify how this would affect the optimization for
> > > merging multiple repartition topics?
> > >
> > > 4. In the "Compatibility, Deprecation, and Migration Plan" section,
> could
> > > you also mention the following scenarios, if any of the upgrade path
> > would
> > > be changed:
> > >
> > >  a) changing user DSL code: under which scenarios users can now do a
> > > rolling bounce instead of resetting applications.
> > >
> > >  b) upgrading from older version to new version, with all the names
> > > specified, and with optimization turned on. E.g. say we have the code
> > > written in 2.1 with all names specified, and now upgrading to 2.2 with
> > new
> > > optimizations that may potentially change the repartition topics. Is
> that
> > > always safe to do?
> > >
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <[email protected]>
> wrote:
> > >
> > >> All I'd like to start a discussion on KIP-372 for the naming of joins
> > and
> > >> grouping operations in Kafka Streams.
> > >>
> > >> The KIP page can be found here:
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> 372%3A+Naming+Joins+and+Grouping
> > >>
> > >> I look forward to feedback and comments.
> > >>
> > >> Thanks,
> > >> Bill
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Reply via email to