rangadi commented on code in PR #44323: URL: https://github.com/apache/spark/pull/44323#discussion_r1581366231
########## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ########## @@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) - val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) - val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) - expr.map(JoinStateValueWatermarkPredicate.apply _) + // For example, if the condition is of the form: + // left_time > right_time + INTERVAL 30 MINUTES + // Then this extracts left_time and right_time. + val attributesInCondition = AttributeSet( + condition.get.collect { case a: AttributeReference => a } + ) + + // Construct an AttributeSet so that we can perform equality between attributes, + // which we do in the filter below. + val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes) + + // oneSideInputAttributes could be [left_value, left_time], and we just + // want the attribute _in_ the time-interval condition. + val oneSideStateWatermarkAttributes = attributesInCondition.filter { a => + oneSideInputAttributeSet.contains(a) Review Comment: Can you give an example? > Is this assured to be left_time mentioned in the comment? What about this part? ########## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ########## @@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) - val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) - val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) - expr.map(JoinStateValueWatermarkPredicate.apply _) + // If the condition itself is empty (for example, left_time < left_time + INTERVAL ...), + // then we will not have generated a stateValueWatermark. + if (stateValueWatermark.isEmpty) { + None + } else { + // For example, if the condition is of the form: + // left_time > right_time + INTERVAL 30 MINUTES + // Then this extracts left_time and right_time. Review Comment: Is `condition` here only the time-interval part of the join condition? (e.g. consider 'A.id = B.id AND A.ts > B.ts + INTERNAL 30 MINUTES). ########## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ########## @@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) - val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) Review Comment: > filtering for the Did you mean filtering out? > Effectively, this line is equivalent to oneSideStateWatermarkAttributes.head. `This line`: Is that the line you removed? ########## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ########## @@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) - val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) - val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) - expr.map(JoinStateValueWatermarkPredicate.apply _) + // If the condition itself is empty (for example, left_time < left_time + INTERVAL ...), + // then we will not have generated a stateValueWatermark. + if (stateValueWatermark.isEmpty) { + None + } else { + // For example, if the condition is of the form: + // left_time > right_time + INTERVAL 30 MINUTES + // Then this extracts left_time and right_time. Review Comment: I.e. can there be an assert here that `attributesInCondition` is exactly two timestamps, one on each side? ########## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ########## @@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) - val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) - val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) - expr.map(JoinStateValueWatermarkPredicate.apply _) + // If the condition itself is empty (for example, left_time < left_time + INTERVAL ...), + // then we will not have generated a stateValueWatermark. + if (stateValueWatermark.isEmpty) { + None + } else { + // For example, if the condition is of the form: + // left_time > right_time + INTERVAL 30 MINUTES + // Then this extracts left_time and right_time. + val attributesInCondition = AttributeSet( + condition.get.collect { case a: AttributeReference => a } + ) + + // Construct an AttributeSet so that we can perform equality between attributes, + // which we do in the filter below. + val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes) + + // oneSideInputAttributes could be [left_value, left_time], and we just + // want the attribute _in_ the time-interval condition. + val oneSideStateWatermarkAttributes = attributesInCondition.filter { a => + oneSideInputAttributeSet.contains(a) + } + + // There should be a single attribute per side in the time-interval condition, so, + // filtering for oneSideInputAttributes as done above should lead us with 1 attribute. + if (oneSideStateWatermarkAttributes.size == 1) { + val expr = + watermarkExpression(Some(oneSideStateWatermarkAttributes.head), stateValueWatermark) + expr.map(JoinStateValueWatermarkPredicate.apply _) + } else { + // This should never happen, since the grammar will ensure that we have one attribute Review Comment: > // This should never happen, Why not assert here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org