[GitHub] [kafka] guozhangwang commented on a change in pull request #10462: KAFKA-10847: Fix spurious results on left/outer stream-stream joins

GitBox Wed, 07 Apr 2021 11:31:32 -0700


guozhangwang commented on a change in pull request #10462:
URL: https://github.com/apache/kafka/pull/10462#discussion_r608896190




##########
File path: 
streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamKStreamJoin.java
##########
@@ -133,34 +141,37 @@ public void process(final K key, final V1 value) {
                 return;
             }
 
-            // maxObservedStreamTime is updated and shared between left and 
right sides, so we can
-            // process a non-join record immediately if it is late
-            final long maxStreamTime = maxObservedStreamTime.updateAndGet(time 
-> Math.max(time, context().timestamp()));
-
             boolean needOuterJoin = outer;
+            boolean joinFound = false;
 
             final long inputRecordTimestamp = context().timestamp();
             final long timeFrom = Math.max(0L, inputRecordTimestamp - 
joinBeforeMs);
             final long timeTo = Math.max(0L, inputRecordTimestamp + 
joinAfterMs);
 
+            // maxObservedStreamTime is updated and shared between left and 
right join sides
+            maxObservedStreamTime.updateAndGet(t -> Math.max(t, 
inputRecordTimestamp));
+
             try (final WindowStoreIterator<V2> iter = otherWindow.fetch(key, 
timeFrom, timeTo)) {
                 while (iter.hasNext()) {
                     needOuterJoin = false;
+                    joinFound = true;
                     final KeyValue<Long, V2> otherRecord = iter.next();
                     final long otherRecordTimestamp = otherRecord.key;
+
+                    // Emit expired records before the joined record to keep 
time ordering

Review comment:
       I think we can refactor the logic here as the following:
   
   0) suppose the received record timestamp is T1, the current stream time is 
T2 >= T1; and we found one or more matching record from the other side, with 
timestamp T1' <= T2' <= T3' etc. The joined record would have the timestamp of 
T1` = max(T1, T1'), T2` = max(T1, T2'), where T1` <= T2` <= ...
   
   1) After we get all the joined records, we do not call `context.forward()` 
yet, but just cache them locally.
   
   2) We then range query the expired records store, and generate the joined 
records (and also delete the records), again we do not call `context.forward()` 
yet, but just cache them locally. 
   
   3) We merge sort on these two sorted-by-timestamp list, and then call 
`context.forward()` on the sorted join result records to emit.
   
   In this we do not need the following complex logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] guozhangwang commented on a change in pull request #10462: KAFKA-10847: Fix spurious results on left/outer stream-stream joins

Reply via email to