HeartSaVioR commented on code in PR #40561:
URL: https://github.com/apache/spark/pull/40561#discussion_r1159396995


##########
python/pyspark/sql/dataframe.py:
##########
@@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
             jdf = self._jdf.dropDuplicates(self._jseq(subset))
         return DataFrame(jdf, self.sparkSession)
 
+    def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] = 
None) -> "DataFrame":

Review Comment:
   Grouping to the above Scala API in question; it is important to decide which 
one we want this API to be consistent with. I designed this API as a sibling of 
existing dropDuplicates API, which I wish to provide the same UX.
   
   Say, for the case of batch query, someone can add the postfix of the name 
from existing query (dropDuplicates -> dropDuplicatesWithinWatermark) and 
expect the query to be run without any issue. It is respected now regardless of 
the type of parameter they use. 
   
   For the case of streaming query, they might already have a query dealing 
with dropDuplicates but had been suffered from the issue I've mentioned in the 
section `why are the changes needed?`. In most cases, we expect them to just 
add the postfix of the name from existing query and discard checkpoint, and 
then enjoy that the problem gets fixed.
   
   I hope this is enough rationale to make the new API be consistent with 
dropDuplicates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to