[ 
https://issues.apache.org/jira/browse/SPARK-51710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51710:
-----------------------------------
    Labels: pull-request-available  (was: )

> Using Dataframe.dropDuplicates with an empty array as argument behaves 
> unexpectedly
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-51710
>                 URL: https://issues.apache.org/jira/browse/SPARK-51710
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.5.5
>            Reporter: David Kunzmann
>            Priority: Major
>              Labels: pull-request-available
>
> When using PySpark DataFrame.dropDuplicates with an empty array as the subset 
> argument, the resulting DataFrame contains a single row (the first row). This 
> behavior is different than using DataFrame.dropDuplicates without any 
> parameters or with None as the subset argument.
>  
> {code:java}
> from pyspark.sql import SparkSession
>  
> spark = SparkSession.builder.getOrCreate()
> data = [
>     (1, "Alice"),
>     (2, "Bob"),
>     (3, "Alice"),
>     (3, "Alice"),
>     (2, "Bob")
> ]
> df = spark.createDataFrame(data, ["id", "name"])
> df_dedup = df.dropDuplicates([])
> df_dedup.show()
> {code}
> The above snippet will show the following DataFrame:
> {code:java}
> +---+-----+
> | id| name|
> +---+-----+
> |  1|Alice|
> +---+-----+ {code}
> I would expect the behavior to be the same as df.dropDuplicates() which 
> returns:
> {code:java}
> +---+-----+
> | id| name|
> +---+-----+
> |  1|Alice|
> |  2|  Bob|
> |  3|Alice|
> +---+-----+ {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to