[ https://issues.apache.org/jira/browse/KAFKA-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742733#comment-16742733 ]
Matthias J. Sax commented on KAFKA-7497: ---------------------------------------- I disagree "that you have a stream of unique events" – the join condition is defined on the record key but the record key is not a primary key for streams: for example, you can have a stream of clicks using the page-id as key. Also note, that each record might join multiple times, not just once. {quote}one side of a pair may be arbitrarily delayed or disordered, which leads to the need for memory on one or both sides of the join.{quote} Not sure what you mean by this. If you refer to the join window, I think this is two different thing. "Delay" or "disorder" seem to refer to wall-clock time, but the join is defined on event-time. Thus, the semantics is to join events that happen temporarily close to each other. This can be translated to the self-join case too: consider the clickstream example with page-id as key, it mean to return all pages, for which there is more than one click within the time window. I don't think this is related to similarity joins at all. > Kafka Streams should support self-join on streams > ------------------------------------------------- > > Key: KAFKA-7497 > URL: https://issues.apache.org/jira/browse/KAFKA-7497 > Project: Kafka > Issue Type: New Feature > Components: streams > Reporter: Robin Moffatt > Priority: Major > Labels: needs-kip > > There are valid reasons to want to join a stream to itself, but Kafka Streams > does not currently support this ({{Invalid topology: Topic foo has already > been registered by another source.}}). To perform the join requires creating > a second stream as a clone of the first, and then doing a join between the > two. This is a clunky workaround and results in unnecessary duplication of > data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)