HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1151462460
########## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ########## @@ -679,6 +679,8 @@ object RemoveNoopUnion extends Rule[LogicalPlan] { d.withNewChildren(Seq(simplifyUnion(u))) case d @ Deduplicate(_, u: Union) => d.withNewChildren(Seq(simplifyUnion(u))) + case d @ DeduplicateWithinWatermark(_, u: Union) => Review Comment: No what I meant is, if we assume the perfect watermark on streaming side, streaming query is designed to produce the same output with batch one. dropDuplicates() is no exception. Arguably, dropDuplicatesWithinWatermark is an exception because they are not only required to reason about "lateness" of the data, but also required to reason about max time duration on duplicated events. (Technically saying, both are different.) Would it be safe to just change read to readStream / write to writeStream and vice versa? Mostly yes for existing API, but maybe not for this API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org