[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755832#comment-16755832 ]
Jeffrey edited comment on SPARK-25420 at 1/30/19 8:15 AM: ---------------------------------------------------------- Thx [~zhiyin1233] Let me not taking it as a bug but a use problem or else, is there a way of use such that the dropDuplicate can return consistently at least one record in every group? In my case, that would be one from Group A, one from Group B and one from Group C? I am still thinking dropDuplicates is expected to leave one record for each group. [https://www.mungingdata.com/apache-spark/deduplicating-and-collapsing] Sorry that I am still a Rookie to Spark but I do not see the logic or where in the Spark document saying that dropDuplicates could drop the whole group of duplicate records without leaving at least one of them. If that is what it really supposed to be, i don't see in what cases that this operation is useful in practice. was (Author: jeffrey.mak): Thx [~zhiyin1233] Let me not taking it as a bug but a use problem or else, is there a way of use such that the dropDuplicate can return consistently at least one record in every group? In my case, that would be one from Group A, one from Group B and one from Group C? I am still thinking dropDuplicates is expected to leave one record for each group. [https://www.mungingdata.com/apache-spark/deduplicating-and-collapsing] Sorry that I am still a Rookie to Spark but I do not see the logic or where in the Spark document saying that dropDuplicates could drop the whole group of duplicate results without leaving at least one of them. If that is what it really supposed to be, i don't see in what cases that this operation is useful in practice. > Dataset.count() every time is different. > ----------------------------------------- > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.3.0 > Environment: spark2.3 > standalone > Reporter: huanghuai > Priority: Major > Labels: SQL > > Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset<Row> dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > ------------------------------------------------------The above is code > --------------------------------------- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org