[ https://issues.apache.org/jira/browse/SPARK-27982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karan Hebbar K S updated SPARK-27982: ------------------------------------- Summary: In spark 2.2.1 filter on a particular column follwed by the drop of the same column fail to filter the all the records (was: In spark 2.2.1 filter on a particular column follwed by the same column fail to filter the all the records) > In spark 2.2.1 filter on a particular column follwed by the drop of the same > column fail to filter the all the records > ---------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-27982 > URL: https://issues.apache.org/jira/browse/SPARK-27982 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.1 > Reporter: Karan Hebbar K S > Priority: Minor > Labels: newbie > Original Estimate: 2m > Remaining Estimate: 2m > > The issue here follows the design of the spark, If the filer is applied on a > column followed by the drop of the same of the column, Then spark filters > only the first record then drops the column as all the transformation filter > + drop is applied to a record as it reads because both the transformation > falls in Narrow stage. > > There by resulting in filtering of only few records neglecting the rest > > Here is sample code > inserts_filtered = inserts.toDF().filter(col("op")=='I') > inserts_without_column_op = inserts_filtered.drop('op') > inserts_without_column_op.repartition("partition_kerys").write.partitionBy("partition_kerys").mode("append").parquet(Path) > > The above lines of code will only write one record with 'I' (value of the > column 'op') filtered neglecting the order records with 'I' (value of the > column 'op') as the column was dropped when first record was filtered. > > > Below is the sample record in csv trying to convert to parquet writing with > partition keys > > Op,key1,key2,created_at,updated_at,name > I,1,11,2017-02-04 12:34:14.000,2019-02-04 12:34:14.000,xyz3 > I,1,11,2017-02-04 12:34:14.000,2019-01-04 12:34:14.000,xyz2 > I,4,41,2018-02-04 12:01:14.000,2018-02-05 12:01:14.000,xyz1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org