HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1159396995
########## python/pyspark/sql/dataframe.py: ########## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf = self._jdf.dropDuplicates(self._jseq(subset)) return DataFrame(jdf, self.sparkSession) + def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] = None) -> "DataFrame": Review Comment: Grouping to the above Scala API in question; it is important to decide which one we want this API to be consistent with. I designed this API as a sibling of existing dropDuplicates API, which I wish to provide the same UX. Say, for the case of batch query, someone can add the postfix of the name from existing query (dropDuplicates -> dropDuplicatesWithinWatermark) and expect the query to be run without any issue. It is respected now regardless of the type of parameter they use. For the case of streaming query, they might already have a query dealing with dropDuplicates but had been suffered from the issue I've mentioned in the section `why are the changes needed?`. In most cases, we expect them to just add the postfix of the name from existing query and discard checkpoint, and then enjoy that the problem gets fixed. I hope this is enough rationale to make the new API be consistent with dropDuplicates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org