[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613217#comment-16613217
 ] 

Marco Gaido commented on SPARK-25420:
-------------------------------------

I think the reason here is that since we don't enforce any sorting on the 
incoming Dataset and we take the first row among those with the same 
aggregation columns, the output of dropDuplicates is random on the row which is 
chosen for each group. So this doesn't really seem a bug itself, but the output 
of dropDuplicates is somewhat non-deterministic unless a specific ordering on 
the input is not enforced.

> Dataset.count()  every time is different.
> -----------------------------------------
>
>                 Key: SPARK-25420
>                 URL: https://issues.apache.org/jira/browse/SPARK-25420
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>         Environment: spark2.3
> standalone
>            Reporter: huanghuai
>            Priority: Major
>              Labels: SQL
>
> Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset<Row> dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> ------------------------------------------------------The above is code 
> ---------------------------------------
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to