Re: [PR] [DO-NOT-REVIEW][DRAFT] Spark 45637 multiple state test [spark]

via GitHub Thu, 30 Nov 2023 11:25:16 -0800


WweiL commented on code in PR #44076:
URL: https://github.com/apache/spark/pull/44076#discussion_r1411160900



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala:
##########
@@ -637,18 +653,22 @@ case class StreamingSymmetricHashJoinExec(
         thisRow: UnsafeRow,
         subIter: Iterator[InternalRow])
       extends CompletionIterator[InternalRow, Iterator[InternalRow]](subIter) {
-
+      // scalastyle:off
       private val iteratorNotEmpty: Boolean = super.hasNext
 
       override def completion(): Unit = {
         val isLeftSemiWithMatch =
           joinType == LeftSemi && joinSide == LeftSide && iteratorNotEmpty
         // Add to state store only if both removal predicates do not match,
         // and the row is not matched for left side of left semi join.
+        println(s"!stateKeyWatermarkPredicateFunc(key): 
${!stateKeyWatermarkPredicateFunc(key)}" +
+          s" !stateValueWatermarkPredicateFunc(thisRow): 
${!stateValueWatermarkPredicateFunc(thisRow)}")
         val shouldAddToState =
           !stateKeyWatermarkPredicateFunc(key) && 
!stateValueWatermarkPredicateFunc(thisRow) &&
           !isLeftSemiWithMatch
         if (shouldAddToState) {
+          println(s"wei==add to state: $thisRow")

Review Comment:
   So what happens here is in the no data batch, the wm of 
`stateKeyWatermarkPredicateFunc` is updated to the new global wm (8). However 
the emitted key from both parent window aggregations are [0, 5), hence 
`stateKeyWatermarkPredicateFunc(key)` returns true, meaning that the window is 
not added to the join state store. As a result, when later 
`SymmetricHashJoinStateManager.getJoinedRows` wants to load the other side's 
stored row, it loads nothing.
   
   This is wrong, because the two [0, 5) windows should be joined here. At 
least one side of the window should be added to the state store, so the other 
side could load it and join. 
   
   This looks like some updates to the multiple state operators that we need to 
consider



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-REVIEW][DRAFT] Spark 45637 multiple state test [spark]

Reply via email to