[ 
https://issues.apache.org/jira/browse/SPARK-27982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan Hebbar K S updated SPARK-27982:
-------------------------------------
    Summary: In spark 2.2.1 filter on a particular column follwed by the drop 
of the same column fail to filter the all the records  (was: In spark 2.2.1 
filter on a particular column follwed by the same column fail to filter the all 
the records)

> In spark 2.2.1 filter on a particular column follwed by the drop of the same 
> column fail to filter the all the records
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27982
>                 URL: https://issues.apache.org/jira/browse/SPARK-27982
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.1
>            Reporter: Karan Hebbar K S
>            Priority: Minor
>              Labels: newbie
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> The issue here follows the design of the spark, If the filer is applied on a 
> column followed by the drop of the same of the column,  Then spark filters 
> only the first record then drops the column as all the transformation  filter 
> + drop is applied to a record as it reads because both the transformation 
> falls in Narrow stage.
>  
> There by resulting in filtering of only few records neglecting the rest
>  
> Here is sample code
> inserts_filtered = inserts.toDF().filter(col("op")=='I')
>  inserts_without_column_op = inserts_filtered.drop('op')
> inserts_without_column_op.repartition("partition_kerys").write.partitionBy("partition_kerys").mode("append").parquet(Path)
>  
> The above lines of code will only write one record with 'I' (value of the 
> column 'op') filtered neglecting the order records with 'I' (value of the 
> column 'op') as the column was dropped when first record was filtered.
>  
>  
> Below is the sample record in csv trying to convert to parquet writing with 
> partition keys
>  
>  Op,key1,key2,created_at,updated_at,name
>  I,1,11,2017-02-04 12:34:14.000,2019-02-04 12:34:14.000,xyz3
>  I,1,11,2017-02-04 12:34:14.000,2019-01-04 12:34:14.000,xyz2
>  I,4,41,2018-02-04 12:01:14.000,2018-02-05 12:01:14.000,xyz1
>   
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to