Steven Rand created SPARK-25538:
-----------------------------------

             Summary: incorrect row counts after distinct()
                 Key: SPARK-25538
                 URL: https://issues.apache.org/jira/browse/SPARK-25538
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
         Environment: Reproduced on a Centos7 VM and from source in Intellij on 
OS X.
            Reporter: Steven Rand


It appears that {{df.distinct.count}} can return incorrect values after 
SPARK-23713. It's possible that other operations are affected as well; 
{{distinct}} just happens to be the one that we noticed. I believe that this 
issue was introduced by SPARK-23713 because I can't reproduce it until that 
commit, and I've been able to reproduce it after that commit as well as with 
{{tags/v2.4.0-rc1}}. 

Below are example spark-shell sessions to illustrate the problem. Unfortunately 
the data used in these examples can't be uploaded to this Jira ticket. I'll try 
to create test data which also reproduces the issue, and will upload that if 
I'm able to do so.

Example from Spark 2.3.1, which behaves correctly:

{code}
scala> val df = spark.read.parquet("hdfs:///data")
df: org.apache.spark.sql.DataFrame = [<redacted>]

scala> df.count
res0: Long = 123

scala> df.distinct.count
res1: Long = 115
{code}

Example from Spark 2.4.0-rc1, which returns different output:

{code}
scala> val df = spark.read.parquet("hdfs:///data")
df: org.apache.spark.sql.DataFrame = [<redacted>]

scala> df.count
res0: Long = 123

scala> df.distinct.count
res1: Long = 116

scala> df.sort("col_0").distinct.count
res2: Long = 123

scala> df.withColumnRenamed("col_0", "newName").distinct.count
res3: Long = 115
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to