HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1151441374
########## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ########## @@ -3038,6 +3038,118 @@ class Dataset[T] private[sql]( dropDuplicates(colNames) } + /** + * Returns a new Dataset with duplicates rows removed, as long as event times of duplicated rows + * are within delay threshold of watermark. + * + * This only works with streaming [[Dataset]], and watermark for the input [[Dataset]] must be Review Comment: I would like to explicitly block users simply switching over batch and streaming with this operator and reason about the similar behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org