[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369750#comment-16369750
 ] 

Manan Bakshi commented on SPARK-23463:
--------------------------------------

Hi Marco,

I realized that if you used following code for the same sample:

rdd = sc.textFile("sample").map(lambda x: x.split(", "))

df = rdd.toDF(["dev", "val"])

df = df.filter(df["val"] > 0)

 

The filter condition actually only filters out values less than 1 even though 
it should actually be filtering out values less than 0. This is really weird. 

 

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23463
>                 URL: https://issues.apache.org/jira/browse/SPARK-23463
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.1
>            Reporter: Manan Bakshi
>            Priority: Critical
>         Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to