Hyukjin Kwon created SPARK-9814:
-----------------------------------

             Summary: EqualNotNull not passing to data sources
                 Key: SPARK-9814
                 URL: https://issues.apache.org/jira/browse/SPARK-9814
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output
         Environment: Centos 6.6
            Reporter: Hyukjin Kwon
            Priority: Minor


When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
`org.apache.spark.sql.sources`, which are appropriately built and picked up by 
`selectFilters()` in  `org.apache.spark.sql.sources.DataSourceStrategy`. 

On the other hand, it does not pass `EqualNullSafe` filter in 
`org.apache.spark.sql.catalyst.expressions` even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass to (below) `buildScan` in `PrunedFilteredScan` and `PrunedScan`,

```
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
```
even though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that  `CatalystScan` can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability). 


In general, the problem below can happen.

1. 
```
SELECT * 
FROM table
WHERE field = 1;
```

2. 
```
SELECT * 
FROM table
WHERE field <=> 1;
```

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.  






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to