[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
huanghuai updated SPARK-25420: ------------------------------ Priority: Trivial (was: Major) > Dataset.count() every time is different. > ----------------------------------------- > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.3.0 > Environment: spark2.3 > standalone > Reporter: huanghuai > Priority: Trivial > Labels: SQL > > Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset<Row> dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > ------------------------------------------------------The above is code > --------------------------------------- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org