[ https://issues.apache.org/jira/browse/SPARK-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-26767: --------------------------------- Priority: Major (was: Blocker) > Filter on a dropDuplicates dataframe gives inconsistency result > --------------------------------------------------------------- > > Key: SPARK-26767 > URL: https://issues.apache.org/jira/browse/SPARK-26767 > Project: Spark > Issue Type: Bug > Components: Build > Affects Versions: 2.3.0 > Environment: To repeat the problem, > (1) create a csv file with records holding same values for a subset of > columns (e.g. colA, colB, colC). > (2) read the csv file as a spark dataframe and then use dropDuplicates to > dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"])) > (3) select the resulting dataframe with where clause. (i.e. df.where("colA = > 'A' and colB='B' and colG='G' and colH='H').show(100,False)) > > => When (3) is rerun, it gives different number of resulting rows. > Reporter: Jeffrey > Priority: Major > > To repeat the problem, > (1) create a csv file with records holding same values for a subset of > columns (e.g. colA, colB, colC). > (2) read the csv file as a spark dataframe and then use dropDuplicates to > dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"])) > (3) select the resulting dataframe with where clause. (i.e. df.where("colA = > 'A' and colB='B' and colG='G' and colH='H').show(100,False)) > > => When (3) is rerun, it gives different number of resulting rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org