zsxwing commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160479372
########## python/pyspark/sql/dataframe.py: ########## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf = self._jdf.dropDuplicates(self._jseq(subset)) return DataFrame(jdf, self.sparkSession) + def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] = None) -> "DataFrame": + """Return a new :class:`DataFrame` with duplicate rows removed, + optionally only considering certain columns, within watermark. + + For a static batch :class:`DataFrame`, it just drops duplicate rows. For a streaming + :class:`DataFrame`, this will keep all data across triggers as intermediate state to drop + duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated + as long as the time distance of earliest and latest events are smaller than the delay + threshold of watermark." The watermark for the input :class:`DataFrame` must be set via + :func:`withWatermark`. Users are encouraged to set the delay threshold of watermark longer Review Comment: It sounds weird to me that a method is called dropDuplicates**WithinWatermark** but I don't need to set the watermark. > For batch, there are a bunch of tools to perform deduplication, distinct / dropDuplicates / dropDuplicatesWithinWatermark. Most of batch use cases don't need to come up with using dropDuplicatesWithinWatermark. I think the common use case for `dropDuplicatesWithinWatermark` in batch is: develop the code in batch mode and switch to streaming later. In this case, catching potential issues in batch mode is better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org