Re: Kafka Streams punctuate with slow-changing KTables

Elliot Crosby-McCullough Wed, 01 Feb 2017 12:25:47 -0800

Ah, bueno!

The first use case I have is actually to split a raw event stream into
facts and dimensions, but it looks like this is still the right solution as
interactive queries can be done against it for "find or create" semantics.


The KIP mentions that the GlobalKTable "will only be used for doing
lookups", but presumably my first pass processor can still output new
dimension entries to the topic that the table is backed by?  Again for
"find or create".

On 1 February 2017 at 19:21, Matthias J. Sax <matth...@confluent.io> wrote:

> Thanks!
>
> About your example: in upcoming 0.10.2 release we will have
> KStream-GlobalKTable join that is designed for exactly your use case to
> enrich a "fact stream" with dimension tables.
>
> If you use this feature, "stream time" will not depend on the
> GlobalKTables (but only only the "main topology") and thus the current
> punctuate() issues should be resolved for this case.
>
> cf.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 99%3A+Add+Global+Tables+to+Kafka+Streams
>
>
>
> -Matthias
>
> On 2/1/17 10:31 AM, Elliot Crosby-McCullough wrote:
> > Yes it does make sense, thank you, that's what I thought the behaviour
> was.
> >
> > I think having the option of triggering at least every n real-time
> seconds
> > (or whatever period) would be very useful, as I can see a number of
> > situations where the time between updates to some tables might be very
> > infrequent indeed.
> >
> > To provide a concrete example, this KTable will be holding the dimensions
> > of a star schema while the KStream will be holding the facts.  If the
> > KTable (or a partition thereof) is limited to one dimension type (i.e.
> > one-to-one with the real star schema tables) then in some cases it will
> be
> > hours or days apart.  The KTable representing the `date` dimension will
> not
> > update faster than once per day.
> >
> > Likewise using kafka as a changelog for a table like Rails'
> > `schema_migrations` could easily go weeks without an update.
> >
> > Until the decision is made regarding the timing would it be best to
> ignore
> > `punctuate` entirely and trigger everything message by message via
> > `process`?
> >
> > On 1 February 2017 at 17:43, Matthias J. Sax <matth...@confluent.io>
> wrote:
> >
> >> One thing to add:
> >>
> >> There are plans/ideas to change punctuate() semantics to "system time"
> >> instead of "stream time". Would this be helpful for your use case?
> >>
> >>
> >> -Matthias
> >>
> >> On 2/1/17 9:41 AM, Matthias J. Sax wrote:
> >>> Yes and no.
> >>>
> >>> It does not depend on the number of tuples but on the timestamps of the
> >>> tuples.
> >>>
> >>> I would assume, that records in the high volume stream have timestamps
> >>> that are only a few milliseconds from each other, while for the low
> >>> volume KTable, record have timestamp differences that are much bigger
> >>> (maybe seconds).
> >>>
> >>> Thus, even if you schedule a punctuation every 30 seconds, it will get
> >>> triggered as expected. As you get KTable input on a second basis that
> >>> advanced KTable time in larger steps -- thus KTable always "catches
> up".
> >>>
> >>> Only for the (real time) case, that a single partition does not make
> >>> process because no new data gets appended that is longer than your
> >>> punctuation interval, some calls to punctuate might not fire.
> >>>
> >>> Let's say the KTable does not get an update for 5 Minutes, than you
> >>> would miss 9 calls to punctuate(), and get only a single call after the
> >>> KTable update. (Of course, only if all partitions advance time
> >> accordingly.)
> >>>
> >>>
> >>> Does this make sense?
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 2/1/17 7:37 AM, Elliot Crosby-McCullough wrote:
> >>>> Hi there,
> >>>>
> >>>> I've been reading through the Kafka Streams documentation and there
> >> seems
> >>>> to be a tricky limitation that I'd like to make sure I've understood
> >>>> correctly.
> >>>>
> >>>> The docs[1] talk about the `punctuate` callback being based on stream
> >> time
> >>>> and that all incoming partitions of all incoming topics must have
> >>>> progressed through the minimum time interval for `punctuate` to be
> >> called.
> >>>>
> >>>> This seems to be like a problem for situations where you have one very
> >> fast
> >>>> and one very slow stream being processed together, for example
> joining a
> >>>> fast-moving KStream to a slow-changing KTable.
> >>>>
> >>>> Have I misunderstood something or is this relatively common use case
> not
> >>>> supported with the `punctuate` callback?
> >>>>
> >>>> Many thanks,
> >>>> Elliot
> >>>>
> >>>> [1]
> >>>> http://docs.confluent.io/3.1.2/streams/developer-guide.
> >> html#defining-a-stream-processor
> >>>> (see the "Attention" box)
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Kafka Streams punctuate with slow-changing KTables

Reply via email to