Re: Kafka Streams punctuate with slow-changing KTables

Matthias J. Sax Wed, 01 Feb 2017 14:41:13 -0800

I am not sure if I can follow... what do you mean by "find or create"
semantics?


What do you mean by "my first pass processor"?


-Matthias

On 2/1/17 12:24 PM, Elliot Crosby-McCullough wrote:
> Ah, bueno!
> 
> The first use case I have is actually to split a raw event stream into
> facts and dimensions, but it looks like this is still the right solution as
> interactive queries can be done against it for "find or create" semantics.
> 
> The KIP mentions that the GlobalKTable "will only be used for doing
> lookups", but presumably my first pass processor can still output new
> dimension entries to the topic that the table is backed by?  Again for
> "find or create".
> 
> On 1 February 2017 at 19:21, Matthias J. Sax <matth...@confluent.io> wrote:
> 
>> Thanks!
>>
>> About your example: in upcoming 0.10.2 release we will have
>> KStream-GlobalKTable join that is designed for exactly your use case to
>> enrich a "fact stream" with dimension tables.
>>
>> If you use this feature, "stream time" will not depend on the
>> GlobalKTables (but only only the "main topology") and thus the current
>> punctuate() issues should be resolved for this case.
>>
>> cf.
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> 99%3A+Add+Global+Tables+to+Kafka+Streams
>>
>>
>>
>> -Matthias
>>
>> On 2/1/17 10:31 AM, Elliot Crosby-McCullough wrote:
>>> Yes it does make sense, thank you, that's what I thought the behaviour
>> was.
>>>
>>> I think having the option of triggering at least every n real-time
>> seconds
>>> (or whatever period) would be very useful, as I can see a number of
>>> situations where the time between updates to some tables might be very
>>> infrequent indeed.
>>>
>>> To provide a concrete example, this KTable will be holding the dimensions
>>> of a star schema while the KStream will be holding the facts.  If the
>>> KTable (or a partition thereof) is limited to one dimension type (i.e.
>>> one-to-one with the real star schema tables) then in some cases it will
>> be
>>> hours or days apart.  The KTable representing the `date` dimension will
>> not
>>> update faster than once per day.
>>>
>>> Likewise using kafka as a changelog for a table like Rails'
>>> `schema_migrations` could easily go weeks without an update.
>>>
>>> Until the decision is made regarding the timing would it be best to
>> ignore
>>> `punctuate` entirely and trigger everything message by message via
>>> `process`?
>>>
>>> On 1 February 2017 at 17:43, Matthias J. Sax <matth...@confluent.io>
>> wrote:
>>>
>>>> One thing to add:
>>>>
>>>> There are plans/ideas to change punctuate() semantics to "system time"
>>>> instead of "stream time". Would this be helpful for your use case?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 2/1/17 9:41 AM, Matthias J. Sax wrote:
>>>>> Yes and no.
>>>>>
>>>>> It does not depend on the number of tuples but on the timestamps of the
>>>>> tuples.
>>>>>
>>>>> I would assume, that records in the high volume stream have timestamps
>>>>> that are only a few milliseconds from each other, while for the low
>>>>> volume KTable, record have timestamp differences that are much bigger
>>>>> (maybe seconds).
>>>>>
>>>>> Thus, even if you schedule a punctuation every 30 seconds, it will get
>>>>> triggered as expected. As you get KTable input on a second basis that
>>>>> advanced KTable time in larger steps -- thus KTable always "catches
>> up".
>>>>>
>>>>> Only for the (real time) case, that a single partition does not make
>>>>> process because no new data gets appended that is longer than your
>>>>> punctuation interval, some calls to punctuate might not fire.
>>>>>
>>>>> Let's say the KTable does not get an update for 5 Minutes, than you
>>>>> would miss 9 calls to punctuate(), and get only a single call after the
>>>>> KTable update. (Of course, only if all partitions advance time
>>>> accordingly.)
>>>>>
>>>>>
>>>>> Does this make sense?
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>> On 2/1/17 7:37 AM, Elliot Crosby-McCullough wrote:
>>>>>> Hi there,
>>>>>>
>>>>>> I've been reading through the Kafka Streams documentation and there
>>>> seems
>>>>>> to be a tricky limitation that I'd like to make sure I've understood
>>>>>> correctly.
>>>>>>
>>>>>> The docs[1] talk about the `punctuate` callback being based on stream
>>>> time
>>>>>> and that all incoming partitions of all incoming topics must have
>>>>>> progressed through the minimum time interval for `punctuate` to be
>>>> called.
>>>>>>
>>>>>> This seems to be like a problem for situations where you have one very
>>>> fast
>>>>>> and one very slow stream being processed together, for example
>> joining a
>>>>>> fast-moving KStream to a slow-changing KTable.
>>>>>>
>>>>>> Have I misunderstood something or is this relatively common use case
>> not
>>>>>> supported with the `punctuate` callback?
>>>>>>
>>>>>> Many thanks,
>>>>>> Elliot
>>>>>>
>>>>>> [1]
>>>>>> http://docs.confluent.io/3.1.2/streams/developer-guide.
>>>> html#defining-a-stream-processor
>>>>>> (see the "Attention" box)
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: Kafka Streams punctuate with slow-changing KTables

Reply via email to