[ 
https://issues.apache.org/jira/browse/KAFKA-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742733#comment-16742733
 ] 

Matthias J. Sax commented on KAFKA-7497:
----------------------------------------

I disagree "that you have a stream of unique events" – the join condition is 
defined on the record key but the record key is not a primary key for streams: 
for example, you can have a stream of clicks using the page-id as key. Also 
note, that each record might join multiple times, not just once.
{quote}one side of a pair may be arbitrarily delayed or disordered, which leads 
to the need for memory on one or both sides of the join.{quote}
Not sure what you mean by this. If you refer to the join window, I think this 
is two different thing. "Delay" or "disorder" seem to refer to wall-clock time, 
but the join is defined on event-time. Thus, the semantics is to join events 
that happen temporarily close to each other.

This can be translated to the self-join case too: consider the clickstream 
example with page-id as key, it mean to return all pages, for which there is 
more than one click within the time window.

I don't think this is related to similarity joins at all.

 

 

> Kafka Streams should support self-join on streams
> -------------------------------------------------
>
>                 Key: KAFKA-7497
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7497
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Robin Moffatt
>            Priority: Major
>              Labels: needs-kip
>
> There are valid reasons to want to join a stream to itself, but Kafka Streams 
> does not currently support this ({{Invalid topology: Topic foo has already 
> been registered by another source.}}).  To perform the join requires creating 
> a second stream as a clone of the first, and then doing a join between the 
> two. This is a clunky workaround and results in unnecessary duplication of 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to