[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755709#comment-16755709 ]
Jeffrey edited comment on SPARK-25420 at 1/30/19 5:59 AM: ---------------------------------------------------------- [~kabhwan] I cannot share the dataset since it is owned by my clients. I could elaborate more on the scenarios: >>drkcard_0_df = >>spark.read.csv("""[wasbs://e...@okprodstorage.blob.core.windows.net/AAA/WICTW/raw/TXN/POS/2018/09/**/*.gz]""") The dataset carries > 100,000 of records. >>drkcard_0_df.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and >>_c2='83' and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|Tom| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel| |2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|James| |2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence| >>dropDup_0=drkcard_0_df.dropDuplicates(["_c0","_c1","_c2","_c3","_c4"]) Running the where for 1st time: {color:#ff0000}Bug: Group C was mistakenly dropped.{color} >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and >>_c2='83' and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary| Running the where again: {color:#ff0000}Acceptable result{color} >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel| |2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence| Running the where again:{color:#ff0000}Bug: Group A and C were mistakenly dropped.{color} >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel| Continue to repeat running the where statement, the number of rows keep changing. was (Author: jeffrey.mak): [~kabhwan] I cannot share the dataset since it is owned by my clients. I could elaborate more on the scenarios: >>drkcard_0_df = >>spark.read.csv("""[wasbs://e...@okprodstorage.blob.core.windows.net/AAA/WICTW/raw/TXN/POS/2018/09/**/*.gz|wasbs://l...@aswprodeastorage.blob.core.windows.net/ASW/WTCTW/raw/TlogParser/PARSED_TLOG/2018/09/**/*.gz]""") The dataset carries > 100,000 of records. >>drkcard_0_df.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and >>_c2='83' and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|Tom| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel| |2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|James| |2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence| >>dropDup_0=drkcard_0_df.dropDuplicates(["_c0","_c1","_c2","_c3","_c4"]) Running the where for 1st time: {color:#FF0000}Bug: Group C was mistakenly dropped.{color} >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' >>and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary| Running the where again: {color:#FF0000}Acceptable result{color} >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel| |2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence| Running the where again:{color:#FF0000}Bug: Group A and C were mistakenly dropped.{color} >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False) |_c0|_c1|_c2|_c3|_c4|_c5| |2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel| Continue to repeat running the where statement, the number of rows keep changing. > Dataset.count() every time is different. > ----------------------------------------- > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.3.0 > Environment: spark2.3 > standalone > Reporter: huanghuai > Priority: Major > Labels: SQL > > Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset<Row> dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > ------------------------------------------------------The above is code > --------------------------------------- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org