[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

Marco Gaido (JIRA) Wed, 30 Jan 2019 00:28:55 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755837#comment-16755837
 ]


Marco Gaido commented on SPARK-25420:
-------------------------------------

[~jeffrey.mak] I cannot reproduce your issue on current master branch. I 
created a test.csv file with the data you provided above and run:

{code}
scala> val drkcard_0_df = spark.read.csv("test.csv")
drkcard_0_df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 
more fields]

scala> drkcard_0_df.show()
+-------------------+-----------+---+---+--------------------+--------+
|                _c0|        _c1|_c2|_c3|                 _c4|     _c5|
+-------------------+-----------+---+---+--------------------+--------+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...|    John|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...|     Tom|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...|    Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...|   Mabel|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|   James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|Laurence|
+-------------------+-----------+---+---+--------------------+--------+

scala> val dropDup_0 = 
drkcard_0_df.dropDuplicates(Seq("_c0","_c1","_c2","_c3","_c4"))
dropDup_0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: 
string, _c1: string ... 4 more fields]

scala> dropDup_0.show()
+-------------------+-----------+---+---+--------------------+-----+
|                _c0|        _c1|_c2|_c3|                 _c4|  _c5|
+-------------------+-----------+---+---+--------------------+-----+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+-------------------+-----------+---+---+--------------------+-----+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='1809192127320082002000018'").show()
+-------------------+-----------+---+---+--------------------+-----+
|                _c0|        _c1|_c2|_c3|                 _c4|  _c5|
+-------------------+-----------+---+---+--------------------+-----+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+-------------------+-----------+---+---+--------------------+-----+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='1809192127320082002000018'").show()
+-------------------+-----------+---+---+--------------------+-----+
|                _c0|        _c1|_c2|_c3|                 _c4|  _c5|
+-------------------+-----------+---+---+--------------------+-----+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+-------------------+-----------+---+---+--------------------+-----+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='1809192127320082002000018'").show()
+-------------------+-----------+---+---+--------------------+-----+
|                _c0|        _c1|_c2|_c3|                 _c4|  _c5|
+-------------------+-----------+---+---+--------------------+-----+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+-------------------+-----------+---+---+--------------------+-----+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='1809192127320082002000018'").show()
+-------------------+-----------+---+---+--------------------+-----+
|                _c0|        _c1|_c2|_c3|                 _c4|  _c5|
+-------------------+-----------+---+---+--------------------+-----+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+-------------------+-----------+---+---+--------------------+-----+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='1809192127320082002000018'").show()
+-------------------+-----------+---+---+--------------------+-----+
|                _c0|        _c1|_c2|_c3|                 _c4|  _c5|
+-------------------+-----------+---+---+--------------------+-----+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+-------------------+-----------+---+---+--------------------+-----+
{code}

> Dataset.count()  every time is different.
> -----------------------------------------
>
>                 Key: SPARK-25420
>                 URL: https://issues.apache.org/jira/browse/SPARK-25420
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.3.0
>         Environment: spark2.3
> standalone
>            Reporter: huanghuai
>            Priority: Major
>              Labels: SQL
>
> Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset<Row> dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> ------------------------------------------------------The above is code 
> ---------------------------------------
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

Reply via email to