WweiL commented on code in PR #44076: URL: https://github.com/apache/spark/pull/44076#discussion_r1411160900
########## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala: ########## @@ -637,18 +653,22 @@ case class StreamingSymmetricHashJoinExec( thisRow: UnsafeRow, subIter: Iterator[InternalRow]) extends CompletionIterator[InternalRow, Iterator[InternalRow]](subIter) { - + // scalastyle:off private val iteratorNotEmpty: Boolean = super.hasNext override def completion(): Unit = { val isLeftSemiWithMatch = joinType == LeftSemi && joinSide == LeftSide && iteratorNotEmpty // Add to state store only if both removal predicates do not match, // and the row is not matched for left side of left semi join. + println(s"!stateKeyWatermarkPredicateFunc(key): ${!stateKeyWatermarkPredicateFunc(key)}" + + s" !stateValueWatermarkPredicateFunc(thisRow): ${!stateValueWatermarkPredicateFunc(thisRow)}") val shouldAddToState = !stateKeyWatermarkPredicateFunc(key) && !stateValueWatermarkPredicateFunc(thisRow) && !isLeftSemiWithMatch if (shouldAddToState) { + println(s"wei==add to state: $thisRow") Review Comment: So what happens here is in the no data batch, the wm of `stateKeyWatermarkPredicateFunc` is updated to the new global wm (8). However the emitted key from both parent window aggregations are [0, 5), hence `stateKeyWatermarkPredicateFunc(key)` returns true, meaning that the window is not added to the join state store. As a result, when later `SymmetricHashJoinStateManager.getJoinedRows` wants to load the other side's stored row, it loads nothing. This is wrong, because the two [0, 5) windows should be joined here. At least one side of the window should be added to the state store, so the other side could load it and join. This looks like some updates to the multiple state operators that we need to consider -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org