Hi Bruno,

Thanks for that idea. I hadn't considered that
option before, and it does seem like that would be
the right place to put it if we think it might be
semantically important to control on a
table-by-table basis.

I had been thinking of it less semantically and
more practically. In the context of a large
topology, or more generally, a large software
system that contains many topologies and other
event-driven systems, each no-op result becomes an
input that is destined to itself become a no-op
result, and so on, all the way through the system.
Thus, a single pointless processing result becomes
amplified into a large number of pointless
computations, cache perturbations, and network 
and disk I/O operations. If you also consider
operations with fan-out implications, like
branching or foreign-key joins, the wasted
resources are amplified not just in proportion to
the size of the system, but the size of the system
times the average fan-out (to the power of the
number of fan-out operations on the path(s)
through the system).

In my time operating such systems, I've observed
these effects to be very real, and actually, the
system and use case doesn't have to be very large
before the amplification poses an existential
threat to the system as a whole.

This is the basis of my advocating for a simple
behavior change, rather than an opt-in config of
any kind. It seems like Streams should "do the
right thing" for the majority use case. My theory
(which may be wrong) is that the majority use case
is more like "relational queries" than "CEP
queries". Even if you were doing some
event-sensitive computation, wouldn't you do them
as Stream operations (where this feature is
inapplicable anyway)?

In keeping with the "practical" perspective, I
suggested the opt-out config only in the (I think
unlikely) event that filtering out pointless
updates actually harms performance. I'd also be
perfectly fine without the opt-out config. I
really think that (because of the timestamp
semantics work already underway), we're already
pre-fetching the prior result most of the time, so
there would actually be very little extra I/O
involved in implementing emit-on-change.

However, we should consider whether my experience
is likely to be general. Do you have some use 
case in mind for which you'd actually want some
KTable results to be emit-on-update for semantic
reasons?

Thanks,
-John


On Fri, Jan 24, 2020, at 11:02, Bruno Cadonna wrote:
> Hi Richard,
> 
> Thank you for the KIP.
> 
> I agree with John that we should focus on the interface and behavior
> change in a KIP. We can discuss the implementation later.
> 
> I am also +1 for the survey.
> 
> I had a thought about this. Couldn't we consider emit-on-change to be
> one config of suppress (like `untilWindowCloses`)? What you basically
> propose is to suppress updates if they do not change the result.
> Considering emit on change as a flavour of suppress would be more
> flexible because it would specify the behavior locally for a KTable
> instead of globally for all KTables. Additionally, specifying the
> behavior in one place instead of multiple places feels more intuitive
> and consistent to me.
> 
> Best,
> Bruno
> 
> On Fri, Jan 24, 2020 at 7:49 AM John Roesler <vvcep...@apache.org> wrote:
> >
> > Hi Richard,
> >
> > Thanks for picking this up! I know of at least one large community member
> > for which this feature is absolutely essential.
> >
> > If I understand your two options, it seems like the proposal is to implement
> > it as a behavior change regardless, and the question is whether to provide
> > an opt-out config or not.
> >
> > Given that any implementation of this feature would have some performance
> > impact under some workloads, and also that we don't know if anyone really
> > depends on emit-on-update time semantics, it seems like we should propose
> > to add an opt-out config. Can you update the KIP to mention the exact
> > config key and value(s) you'd propose?
> >
> > Just to move the discussion forward, maybe something like:
> >     emit.on := change|update
> > with the new default being "change"
> >
> > Thanks for pointing out the timestamp issue in particular. I agree that if
> > we discard the latter update as a no-op, then we also have to discard its
> > timestamp (obviously, we don't forward the timestamp update, as that's
> > the whole point, but we also can't update the timestamp in the store, as
> > the store must remain consistent with what has been emitted).
> >
> > I have to confess that I disagree with your implementation proposal, but
> > it's also not necessary to discuss implementation in the KIP. Maybe it would
> > be less controversial if you just drop that section for now, so that the KIP
> > discussion can focus on the behavior change and config.
> >
> > Just for reference, there is some research into this domain. For example,
> > see the "Report" section (3.2.3) of the SECRET paper:
> > http://people.csail.mit.edu/tatbul/publications/maxstream_vldb10.pdf
> >
> > It might help to round out the proposal if you take a brief survey of the
> > behaviors of other systems, along with pros and cons if any are reported.
> >
> > Thanks,
> > -John
> >
> >
> > On Fri, Jan 10, 2020, at 22:27, Richard Yu wrote:
> > > Hi everybody!
> > >
> > > I'd like to propose a change that we probably should've added for a long
> > > time now.
> > >
> > > The key benefit of this KIP would be reduced traffic in Kafka Streams 
> > > since
> > > a lot of no-op results would no longer be sent downstream.
> > > Here is the KIP for reference.
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-557%3A+Add+emit+on+change+support+for+Kafka+Streams
> > >
> > > Currently, I seek to formalize our approach for this KIP first before we
> > > determine concrete API additions / configurations.
> > > Some configs might warrant adding, whiles others are not necessary since
> > > adding them would only increase complexity of Kafka Streams.
> > >
> > > Cheers,
> > > Richard
> > >
>

Reply via email to