[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370917#comment-16370917 ]
Manan Bakshi commented on SPARK-23463: -------------------------------------- Hi Marco, That makes sense. However, this same code used to work fine for Spark 2.1.1 regardless of whether you compare against 0 or 0.0. Can you help me understand what changed? > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.2.1 > Reporter: Manan Bakshi > Priority: Critical > Attachments: sample > > > Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was > introduced to look at the table stats and decide filter selectivity. However, > since then, filter has started behaving unexpectedly for blank values. The > operation would not only drop columns with blank values but also filter out > rows that actually meet the filter criteria. > Steps to repro > Consider a simple dataframe with some blank values as below: > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org