[ 
https://issues.apache.org/jira/browse/KAFKA-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270739#comment-17270739
 ] 

Stuart Perks edited comment on KAFKA-4113 at 1/23/21, 7:33 PM:
---------------------------------------------------------------

[~mjsax] I have a scenario where I have a KTable which is compacted topic, 
which I convert to a stream and then stream the data, flatmap it and rekey and 
join with other data on the new keys with the same KTable. With cache off this 
scenario will process each message if the topic has not compacted them.

So basically a self join on the KTable. I want to always use the latest data on 
the KTable so a bootstrap function would be great does not seem to be happening 
so looking for alternatives. If I attempt the 0 custom timestamp extractor this 
does not work as data is the same so both stream and table would be 0. Keeping 
the normal time semantic it seems of processing each record even if the same 
key. 

Are there any other ideas of ways around this to always join with the latest 
data on a KTable when i am already driving the join from the same KTable.

Distinguish the timestamp extractor differently between the KTable and the 
KTable.toStream seems unlikely. 








was (Author: perks):
[~mjsax] I have a scenario where I have a KTable which is compacted topic, 
which I convert to a stream and then stream the data, flatmap it and rekey and 
join with other data on the new keys with the same KTable. With cache off this 
scenario will process each message if the topic has not compacted them.

So basically a self join on the KTable. I want to always use the latest data on 
the KTable so a bootstrap function would be great does not seem to be 
happening. If I attempt the 0 custom timestamp extractor this does not work as 
data is the same so both stream and table would be 0. Keeping the normal time 
semantic it seems of processing each record even if the same key. 

Are there any other ideas of ways around this to always join with the latest 
data on a KTable when i am already driving the join from the same KTable.

Distinguish the timestamp extractor differently between the KTable and the 
KTable.toStream seems unlikely. 







> Allow KTable bootstrap
> ----------------------
>
>                 Key: KAFKA-4113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4113
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Major
>
> On the mailing list, there are multiple request about the possibility to 
> "fully populate" a KTable before actual stream processing start.
> Even if it is somewhat difficult to define, when the initial populating phase 
> should end, there are multiple possibilities:
> The main idea is, that there is a rarely updated topic that contains the 
> data. Only after this topic got read completely and the KTable is ready, the 
> application should start processing. This would indicate, that on startup, 
> the current partition sizes must be fetched and stored, and after KTable got 
> populated up to those offsets, stream processing can start.
> Other discussed ideas are:
> 1) an initial fixed time period for populating
> (it might be hard for a user to estimate the correct value)
> 2) an "idle" period, ie, if no update to a KTable for a certain time is
> done, we consider it as populated
> 3) a timestamp cut off point, ie, all records with an older timestamp
> belong to the initial populating phase
> The API change is not decided yet, and the API desing is part of this JIRA.
> One suggestion (for option (4)) was:
> {noformat}
> KTable table = builder.table("topic", 1000); // populate the table without 
> reading any other topics until see one record with timestamp 1000.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to