Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]

via GitHub Fri, 26 Apr 2024 11:13:35 -0700


rangadi commented on code in PR #44323:
URL: https://github.com/apache/spark/pull/44323#discussion_r1581366231



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##########
@@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging {
           attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
           condition,
           eventTimeWatermarkForEviction)
-        val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-        val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-        expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+        // For example, if the condition is of the form:
+        //    left_time > right_time + INTERVAL 30 MINUTES
+        // Then this extracts left_time and right_time.
+        val attributesInCondition = AttributeSet(
+          condition.get.collect { case a: AttributeReference => a }
+        )
+
+        // Construct an AttributeSet so that we can perform equality between 
attributes,
+        // which we do in the filter below.
+        val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+        // oneSideInputAttributes could be [left_value, left_time], and we just
+        // want the attribute _in_ the time-interval condition.
+        val oneSideStateWatermarkAttributes = attributesInCondition.filter { a 
=>
+            oneSideInputAttributeSet.contains(a)

Review Comment:
   Can you give an example? 
   
   > Is this assured to be left_time mentioned in the comment?
   
   What about this part? 



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##########
@@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging {
           attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
           condition,
           eventTimeWatermarkForEviction)
-        val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-        val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-        expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+        // If the condition itself is empty (for example, left_time < 
left_time + INTERVAL ...),
+        // then we will not have generated a stateValueWatermark.
+        if (stateValueWatermark.isEmpty) {
+          None
+        } else {
+          // For example, if the condition is of the form:
+          //    left_time > right_time + INTERVAL 30 MINUTES
+          // Then this extracts left_time and right_time.

Review Comment:
   Is `condition` here only the time-interval part of the join condition? (e.g. 
consider 'A.id = B.id AND A.ts > B.ts + INTERNAL 30 MINUTES).



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##########
@@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging {
           attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
           condition,
           eventTimeWatermarkForEviction)
-        val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))

Review Comment:
   > filtering for the 
   
   Did you mean filtering out? 
   
   > Effectively, this line is equivalent to 
oneSideStateWatermarkAttributes.head.
   
   `This line`: Is that the line you removed? 
   



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##########
@@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging {
           attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
           condition,
           eventTimeWatermarkForEviction)
-        val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-        val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-        expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+        // If the condition itself is empty (for example, left_time < 
left_time + INTERVAL ...),
+        // then we will not have generated a stateValueWatermark.
+        if (stateValueWatermark.isEmpty) {
+          None
+        } else {
+          // For example, if the condition is of the form:
+          //    left_time > right_time + INTERVAL 30 MINUTES
+          // Then this extracts left_time and right_time.

Review Comment:
   I.e. can there be an assert here that `attributesInCondition` is exactly two 
timestamps, one on each side? 



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##########
@@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging {
           attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
           condition,
           eventTimeWatermarkForEviction)
-        val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-        val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-        expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+        // If the condition itself is empty (for example, left_time < 
left_time + INTERVAL ...),
+        // then we will not have generated a stateValueWatermark.
+        if (stateValueWatermark.isEmpty) {
+          None
+        } else {
+          // For example, if the condition is of the form:
+          //    left_time > right_time + INTERVAL 30 MINUTES
+          // Then this extracts left_time and right_time.
+          val attributesInCondition = AttributeSet(
+            condition.get.collect { case a: AttributeReference => a }
+          )
+
+          // Construct an AttributeSet so that we can perform equality between 
attributes,
+          // which we do in the filter below.
+          val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+          // oneSideInputAttributes could be [left_value, left_time], and we 
just
+          // want the attribute _in_ the time-interval condition.
+          val oneSideStateWatermarkAttributes = attributesInCondition.filter { 
a =>
+            oneSideInputAttributeSet.contains(a)
+          }
+
+          // There should be a single attribute per side in the time-interval 
condition, so,
+          // filtering for oneSideInputAttributes as done above should lead us 
with 1 attribute.
+          if (oneSideStateWatermarkAttributes.size == 1) {
+            val expr =
+              watermarkExpression(Some(oneSideStateWatermarkAttributes.head), 
stateValueWatermark)
+            expr.map(JoinStateValueWatermarkPredicate.apply _)
+          } else {
+            // This should never happen, since the grammar will ensure that we 
have one attribute

Review Comment:
   > // This should never happen,
   
   Why not assert here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]

Reply via email to