huanghuai created SPARK-25420: --------------------------------- Summary: Dataset.count() every time is different. Key: SPARK-25420 URL: https://issues.apache.org/jira/browse/SPARK-25420 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Environment: spark2.3
standalone Reporter: huanghuai Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", ",").option("inferSchema", "true") .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") .option("encoding", "UTF-8") .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); System.out.println("source count="+dataset.count()); Dataset<Row> dropDuplicates = dataset.dropDuplicates(new String[]\{"DATE","TIME","VEL","COMPANY"}); System.out.println("dropDuplicates count1="+dropDuplicates.count()); System.out.println("dropDuplicates count2="+dropDuplicates.count()); Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 and (status = 0 or status = 1)"); System.out.println("filter count1="+filter.count()); System.out.println("filter count2="+filter.count()); System.out.println("filter count3="+filter.count()); System.out.println("filter count4="+filter.count()); System.out.println("filter count5="+filter.count()); ------------------------------------------------------The above is code --------------------------------------- console output: source count=459275 dropDuplicates count1=453987 dropDuplicates count2=453987 filter count1=445798 filter count2=445797 filter count3=445797 filter count4=445798 filter count5=445799 question: Why is filter.count() different everytime? if I remove dropDuplicates() everything will be ok!! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org