[jira] [Comment Edited] (SPARK-25420) Dataset.count() every time is different.

huanghuai (JIRA) Tue, 29 Jan 2019 23:24:39 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755762#comment-16755762
 ]


huanghuai edited comment on SPARK-25420 at 1/30/19 7:13 AM:
------------------------------------------------------------

-------------------------------  the code to be  reproduced 
------------------------------

SparkSession spark = 
SparkSession.builder().appName("test").master("local[2]").master("local").getOrCreate();
 spark.conf().set("spark.sql.shuffle.partitions", 3);
 spark.conf().set("spark.default.parallelism", 3);
 List<Row> data1 = Arrays.asList(
 RowFactory.create("jack", 22, "gz"),
 RowFactory.create("lisi", 22, "sh"),
 RowFactory.create("wangwu", 22, "tj"),
 RowFactory.create("leo", 21, "alibaba"),
 RowFactory.create("a", 22, "sz"),
 RowFactory.create("b", 22, "sz1"),
 RowFactory.create("c", 22, "sz2"),
 RowFactory.create("d", 22, "sz3"),
 RowFactory.create("mic", 22, "bc")
 );
 StructType schema1 = new StructType(
 new StructField[]

{ new StructField("name", DataTypes.StringType, true, Metadata.empty()), new 
StructField("age", DataTypes.IntegerType, true, Metadata.empty()), new 
StructField("address", DataTypes.StringType, true, Metadata.empty()), }

);

for (int i = 0; i < 5; i++)

{ Dataset<Row> dropDuplicates = spark.createDataFrame(data1, schema1)

.repartition(2) // **** when you set repartition is 1,2,3 everytime you can see 
the difference in show method.

.dropDuplicates("age");

dropDuplicates.show();

dropDuplicates = dropDuplicates.filter("address like 'sz%'");

System.out.println("count=" + dropDuplicates.count());

//I find another problem: filter("address='sz2').show() ,the result have two 
record

// but if I use:filter("address='sz2').count() , the count is 0

}

--------------------------- It is not a bug it self, just a use problem, or 
else.----------------------------------


was (Author: zhiyin1233):
-------------------------------  the code to be  reproduced 
------------------------------

SparkSession spark = 
SparkSession.builder().appName("test").master("local[2]").master("local").getOrCreate();
 spark.conf().set("spark.sql.shuffle.partitions", 3);
 spark.conf().set("spark.default.parallelism", 3);
 List<Row> data1 = Arrays.asList(
 RowFactory.create("jack", 22, "gz"),
 RowFactory.create("lisi", 22, "sh"),
 RowFactory.create("wangwu", 22, "tj"),
 RowFactory.create("leo", 21, "alibaba"),
 RowFactory.create("a", 22, "sz"),
 RowFactory.create("b", 22, "sz1"),
 RowFactory.create("c", 22, "sz2"),
 RowFactory.create("d", 22, "sz3"),
 RowFactory.create("mic", 22, "bc")
 );
 StructType schema1 = new StructType(
 new StructField[]

{ new StructField("name", DataTypes.StringType, true, Metadata.empty()), new 
StructField("age", DataTypes.IntegerType, true, Metadata.empty()), new 
StructField("address", DataTypes.StringType, true, Metadata.empty()), }

);

for (int i = 0; i < 5; i++)

{ Dataset<Row> dropDuplicates = spark.createDataFrame(data1, schema1) 
.repartition(2) // **** when you set repartition is 1,2,3 everytime you can see 
the difference in show method. .dropDuplicates("age"); dropDuplicates.show(); 
dropDuplicates = dropDuplicates.filter("address like 'sz%'"); 
System.out.println("count=" + dropDuplicates.count()); //I find another 
problem: filter("address='sz2').show() ,the result have two record // but if I 
use:filter("address='sz2').count() , the count is 0 }

--------------------------- It is not a bug it self, just a use problem, or 
else.----------------------------------

> Dataset.count()  every time is different.
> -----------------------------------------
>
>                 Key: SPARK-25420
>                 URL: https://issues.apache.org/jira/browse/SPARK-25420
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.3.0
>         Environment: spark2.3
> standalone
>            Reporter: huanghuai
>            Priority: Major
>              Labels: SQL
>
> Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset<Row> dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> ------------------------------------------------------The above is code 
> ---------------------------------------
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25420) Dataset.count() every time is different.

Reply via email to