[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755762#comment-16755762 ]
huanghuai edited comment on SPARK-25420 at 1/30/19 7:13 AM: ------------------------------------------------------------ ------------------------------- the code to be reproduced ------------------------------ SparkSession spark = SparkSession.builder().appName("test").master("local[2]").master("local").getOrCreate(); spark.conf().set("spark.sql.shuffle.partitions", 3); spark.conf().set("spark.default.parallelism", 3); List<Row> data1 = Arrays.asList( RowFactory.create("jack", 22, "gz"), RowFactory.create("lisi", 22, "sh"), RowFactory.create("wangwu", 22, "tj"), RowFactory.create("leo", 21, "alibaba"), RowFactory.create("a", 22, "sz"), RowFactory.create("b", 22, "sz1"), RowFactory.create("c", 22, "sz2"), RowFactory.create("d", 22, "sz3"), RowFactory.create("mic", 22, "bc") ); StructType schema1 = new StructType( new StructField[] { new StructField("name", DataTypes.StringType, true, Metadata.empty()), new StructField("age", DataTypes.IntegerType, true, Metadata.empty()), new StructField("address", DataTypes.StringType, true, Metadata.empty()), } ); for (int i = 0; i < 5; i++) { Dataset<Row> dropDuplicates = spark.createDataFrame(data1, schema1) .repartition(2) // **** when you set repartition is 1,2,3 everytime you can see the difference in show method. .dropDuplicates("age"); dropDuplicates.show(); dropDuplicates = dropDuplicates.filter("address like 'sz%'"); System.out.println("count=" + dropDuplicates.count()); //I find another problem: filter("address='sz2').show() ,the result have two record // but if I use:filter("address='sz2').count() , the count is 0 } --------------------------- It is not a bug it self, just a use problem, or else.---------------------------------- was (Author: zhiyin1233): ------------------------------- the code to be reproduced ------------------------------ SparkSession spark = SparkSession.builder().appName("test").master("local[2]").master("local").getOrCreate(); spark.conf().set("spark.sql.shuffle.partitions", 3); spark.conf().set("spark.default.parallelism", 3); List<Row> data1 = Arrays.asList( RowFactory.create("jack", 22, "gz"), RowFactory.create("lisi", 22, "sh"), RowFactory.create("wangwu", 22, "tj"), RowFactory.create("leo", 21, "alibaba"), RowFactory.create("a", 22, "sz"), RowFactory.create("b", 22, "sz1"), RowFactory.create("c", 22, "sz2"), RowFactory.create("d", 22, "sz3"), RowFactory.create("mic", 22, "bc") ); StructType schema1 = new StructType( new StructField[] { new StructField("name", DataTypes.StringType, true, Metadata.empty()), new StructField("age", DataTypes.IntegerType, true, Metadata.empty()), new StructField("address", DataTypes.StringType, true, Metadata.empty()), } ); for (int i = 0; i < 5; i++) { Dataset<Row> dropDuplicates = spark.createDataFrame(data1, schema1) .repartition(2) // **** when you set repartition is 1,2,3 everytime you can see the difference in show method. .dropDuplicates("age"); dropDuplicates.show(); dropDuplicates = dropDuplicates.filter("address like 'sz%'"); System.out.println("count=" + dropDuplicates.count()); //I find another problem: filter("address='sz2').show() ,the result have two record // but if I use:filter("address='sz2').count() , the count is 0 } --------------------------- It is not a bug it self, just a use problem, or else.---------------------------------- > Dataset.count() every time is different. > ----------------------------------------- > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.3.0 > Environment: spark2.3 > standalone > Reporter: huanghuai > Priority: Major > Labels: SQL > > Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset<Row> dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > ------------------------------------------------------The above is code > --------------------------------------- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org