[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633254#comment-16633254
 ] 

Steven Rand commented on SPARK-25538:
-------------------------------------

[~kiszk] I've uploaded a tarball containing parquet files that reproduce the 
issue but don't contain any of the values in the original dataset. 
Specifically, some columns have been dropped, all strings have been changed to 
"test_string", all values in col_50 have been changed to 0.0043, and the values 
in col_14 have all been mapped from their original values to values between 
0.001 and 0.0044.

This new DataFrame still reproduces issues similar to those in the description:
{code:java}
scala> df.distinct.count
res3: Long = 64

scala> df.sort("col_0").distinct.count
res4: Long = 73

scala> df.withColumnRenamed("col_0", "new").distinct.count
res5: Long = 63
{code}
I get those inconsistent/wrong results on {{2.4.0-rc2}} and if I check out 
commit {{a7c19d9c21d59fd0109a7078c80b33d3da03fafd}}, which is SPARK-23713. If I 
check out the commit immediately before, which is 
{{fe2b7a4568d65a62da6e6eb00fff05f248b4332c}}, then all three commands return 63.

cc [~cloud_fan] – IMO this should block the 2.4.0 release.

> incorrect row counts after distinct()
> -------------------------------------
>
>                 Key: SPARK-25538
>                 URL: https://issues.apache.org/jira/browse/SPARK-25538
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>            Reporter: Steven Rand
>            Priority: Major
>              Labels: correctness
>         Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = [<redacted>]
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = [<redacted>]
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to