[ 
https://issues.apache.org/jira/browse/KAFKA-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510911#comment-16510911
 ] 

Ben Stopford edited comment on KAFKA-4113 at 6/13/18 4:49 PM:
--------------------------------------------------------------

Whilst I like the 'time-aligned' approach to loading KTables very much, it 
definitely catches people out. I think this is compounded by the fact that 
GKTables don't behave like this (they bootstrap themselves on startup rather 
than being time aligned).

Different use cases actually better suit one or the other (as noted above). So 
for example, if you're joining Orders to Customers and doing reprocessing you 
might want the 'as at' version of the customer (say with an old email address) 
or the latest version of the customer (with their most recent email).

So I think KStreams should support both (a) preloaded or (b) event time ideally 
in both types of table, letting the user define the behaviour.


was (Author: benstopford):
Whilst I like the 'time-aligned' approach to loading KTables very much, it 
definitely catches people out. I think this is compounded by the fact that 
GKTables don't behave like this (they bootstrap themselves on startup rather 
than being time aligned).

Different use cases actually better suit one or the other (as noted above). So 
for example, if you're joining Orders to Customers and doing reprocessing you 
might want the 'as at' version of the customer (say with an old email address) 
or the latest version of the customer (with their most recent email).

So I think KStreams should support both (a) preloaded or (b) event time ideally 
in both types of table, letting the user define the behaviour.

I've tried to explain the background to this in a bit more detail 
[here|http://www.benstopford.com/2018/06/13/things-can-trip-building-streams-apps/].
 

> Allow KTable bootstrap
> ----------------------
>
>                 Key: KAFKA-4113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4113
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Matthias J. Sax
>            Assignee: Guozhang Wang
>            Priority: Major
>
> On the mailing list, there are multiple request about the possibility to 
> "fully populate" a KTable before actual stream processing start.
> Even if it is somewhat difficult to define, when the initial populating phase 
> should end, there are multiple possibilities:
> The main idea is, that there is a rarely updated topic that contains the 
> data. Only after this topic got read completely and the KTable is ready, the 
> application should start processing. This would indicate, that on startup, 
> the current partition sizes must be fetched and stored, and after KTable got 
> populated up to those offsets, stream processing can start.
> Other discussed ideas are:
> 1) an initial fixed time period for populating
> (it might be hard for a user to estimate the correct value)
> 2) an "idle" period, ie, if no update to a KTable for a certain time is
> done, we consider it as populated
> 3) a timestamp cut off point, ie, all records with an older timestamp
> belong to the initial populating phase
> The API change is not decided yet, and the API desing is part of this JIRA.
> One suggestion (for option (4)) was:
> {noformat}
> KTable table = builder.table("topic", 1000); // populate the table without 
> reading any other topics until see one record with timestamp 1000.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to