[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755837#comment-16755837 ]
Marco Gaido commented on SPARK-25420: ------------------------------------- [~jeffrey.mak] I cannot reproduce your issue on current master branch. I created a test.csv file with the data you provided above and run: {code} scala> val drkcard_0_df = spark.read.csv("test.csv") drkcard_0_df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields] scala> drkcard_0_df.show() +-------------------+-----------+---+---+--------------------+--------+ | _c0| _c1|_c2|_c3| _c4| _c5| +-------------------+-----------+---+---+--------------------+--------+ |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| John| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| Tom| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mary| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mabel| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...| James| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...|Laurence| +-------------------+-----------+---+---+--------------------+--------+ scala> val dropDup_0 = drkcard_0_df.dropDuplicates(Seq("_c0","_c1","_c2","_c3","_c4")) dropDup_0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: string, _c1: string ... 4 more fields] scala> dropDup_0.show() +-------------------+-----------+---+---+--------------------+-----+ | _c0| _c1|_c2|_c3| _c4| _c5| +-------------------+-----------+---+---+--------------------+-----+ |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...|James| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mary| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| John| +-------------------+-----------+---+---+--------------------+-----+ scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show() +-------------------+-----------+---+---+--------------------+-----+ | _c0| _c1|_c2|_c3| _c4| _c5| +-------------------+-----------+---+---+--------------------+-----+ |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...|James| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mary| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| John| +-------------------+-----------+---+---+--------------------+-----+ scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show() +-------------------+-----------+---+---+--------------------+-----+ | _c0| _c1|_c2|_c3| _c4| _c5| +-------------------+-----------+---+---+--------------------+-----+ |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...|James| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mary| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| John| +-------------------+-----------+---+---+--------------------+-----+ scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show() +-------------------+-----------+---+---+--------------------+-----+ | _c0| _c1|_c2|_c3| _c4| _c5| +-------------------+-----------+---+---+--------------------+-----+ |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...|James| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mary| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| John| +-------------------+-----------+---+---+--------------------+-----+ scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show() +-------------------+-----------+---+---+--------------------+-----+ | _c0| _c1|_c2|_c3| _c4| _c5| +-------------------+-----------+---+---+--------------------+-----+ |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...|James| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mary| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| John| +-------------------+-----------+---+---+--------------------+-----+ scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show() +-------------------+-----------+---+---+--------------------+-----+ | _c0| _c1|_c2|_c3| _c4| _c5| +-------------------+-----------+---+---+--------------------+-----+ |2018-09-21 00:00:00|TDT_DSC_ITM| 83| C|18091921273200820...|James| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| B|18091921273200820...| Mary| |2018-09-21 00:00:00|TDT_DSC_ITM| 83| A|18091921273200820...| John| +-------------------+-----------+---+---+--------------------+-----+ {code} > Dataset.count() every time is different. > ----------------------------------------- > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.3.0 > Environment: spark2.3 > standalone > Reporter: huanghuai > Priority: Major > Labels: SQL > > Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset<Row> dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > ------------------------------------------------------The above is code > --------------------------------------- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org