HeartSaVioR commented on code in PR #40561:
URL: https://github.com/apache/spark/pull/40561#discussion_r1159292444


##########
python/pyspark/sql/dataframe.py:
##########
@@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
             jdf = self._jdf.dropDuplicates(self._jseq(subset))
         return DataFrame(jdf, self.sparkSession)
 
+    def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
+        """Return a new :class:`DataFrame` with duplicate rows removed,
+         optionally only considering certain columns, within watermark.
+
+        For a static batch :class:`DataFrame`, it just drops duplicate rows. 
For a streaming
+        :class:`DataFrame`, this will keep all data across triggers as 
intermediate state to drop
+        duplicated rows. The state will be kept to guarantee the semantic, 
"Events are deduplicated
+        as long as the time distance of earliest and latest events are smaller 
than the delay
+        threshold of watermark." The watermark for the input 
:class:`DataFrame` must be set via
+        :func:`withWatermark`. Users are encouraged to set the delay threshold 
of watermark longer

Review Comment:
   batch DataFrame does not require watermark, and we actually remove 
withWatermark if the query is batch one. If we require batch query to provide 
watermark, that will be very odd because we ignore the delay threshold in any 
way.
   
   So I think there are two choices, 1) do not support batch query since it's 
confusing vs 2) tolerate the difference of UX and just do the same with 
existing API for batch case. Currently it's 2) but the initial proposal was 1). 
I'm open for both.
   cc. @rangadi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to