[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755832#comment-16755832
 ] 

Jeffrey edited comment on SPARK-25420 at 1/30/19 8:15 AM:
----------------------------------------------------------

Thx [~zhiyin1233]

Let me not taking it as a bug but a use problem or else, is there a way of use 
such that the dropDuplicate can return consistently at least one record in 
every group? In my case, that would be one from Group A, one from Group B and 
one from Group C? 

I am still thinking dropDuplicates is expected to leave one record for each 
group.

[https://www.mungingdata.com/apache-spark/deduplicating-and-collapsing]

 

Sorry that I am still a Rookie to Spark but I do not see the logic or where in 
the Spark document saying that dropDuplicates could drop the whole group of 
duplicate records without leaving at least one of them. If that is what it 
really supposed to be, i don't see in what cases that this operation is useful 
in practice.


was (Author: jeffrey.mak):
Thx [~zhiyin1233]

Let me not taking it as a bug but a use problem or else, is there a way of use 
such that the dropDuplicate can return consistently at least one record in 
every group? In my case, that would be one from Group A, one from Group B and 
one from Group C? 

I am still thinking dropDuplicates is expected to leave one record for each 
group.

[https://www.mungingdata.com/apache-spark/deduplicating-and-collapsing]

 

Sorry that I am still a Rookie to Spark but I do not see the logic or where in 
the Spark document saying that dropDuplicates could drop the whole group of 
duplicate results without leaving at least one of them. If that is what it 
really supposed to be, i don't see in what cases that this operation is useful 
in practice.

> Dataset.count()  every time is different.
> -----------------------------------------
>
>                 Key: SPARK-25420
>                 URL: https://issues.apache.org/jira/browse/SPARK-25420
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.3.0
>         Environment: spark2.3
> standalone
>            Reporter: huanghuai
>            Priority: Major
>              Labels: SQL
>
> Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset<Row> dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> ------------------------------------------------------The above is code 
> ---------------------------------------
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to