[ 
https://issues.apache.org/jira/browse/KAFKA-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743225#comment-16743225
 ] 

John Roesler commented on KAFKA-7497:
-------------------------------------

Thanks [~mjsax],

I see that the "key" field in Kafka can be set to anything. My question was 
about the semantics of a stream-stream join. I've read our javadoc, and all it 
says is that it does an "inner equi-join" restricted by the time window. I 
guess this means that, given two streams `U=<u1, u2, u3>` and `V=<v1, v2, v3>`, 
it produces at least one result pair `(ui, vj)` for each pair in the cartesian 
product of the streams such that `ui.key == vj.key` and `abs(ui.time - vj.time) 
<= window_size`. Under this definition, if we happen to set V := U, then the 
operation is still well defined.

It sounds like this is the precise ask, since at the moment, choosing `V := U` 
throws a runtime error, even though it's not semantically prohibited.

It does seem like part of the scope of work should be to implement it 
efficiently, that is, to detect that both streams are actually the same at 
topology-build-time and ensure that we only need one join window store.

If I understand this scoping correctly, there's no public API change, just a 
behavior change. Also, since it's currently not possible to start a topology 
with a stream self-join, there's no deprecation or migration plan needed. 
Therefore no KIP is required.

Sound good?

> Kafka Streams should support self-join on streams
> -------------------------------------------------
>
>                 Key: KAFKA-7497
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7497
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Robin Moffatt
>            Priority: Major
>              Labels: needs-kip
>
> There are valid reasons to want to join a stream to itself, but Kafka Streams 
> does not currently support this ({{Invalid topology: Topic foo has already 
> been registered by another source.}}).  To perform the join requires creating 
> a second stream as a clone of the first, and then doing a join between the 
> two. This is a clunky workaround and results in unnecessary duplication of 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to