Dear Sir / Madam,

I have a plan to contribute some codes about passing filters to a
datasource as physical planning.

In more detail, I understand when we want to build up filter operations
from data like Parquet (when actually reading and filtering HDFS blocks at
first not filtering in memory with Spark operations), we need to implement

PrunedFilteredScan, PrunedScan or CatalystScan in package
org.apache.spark.sql.sources.



For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
org.apache.spark.sql.sources, which do not access directly to the query
parser but are objects built by selectFilters() in package
org.apache.spark.sql.sources.DataSourceStrategy.

It looks all the filters (rather raw expressions) do not pass to the
function below in PrunedFilteredScan and PrunedScan.

def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

The passing filters in here are defined in package
org.apache.spark.sql.sources.

On the other hand, it does not pass EqualNullSafe filter in package
org.apache.spark.sql.catalyst.expressions even though this looks possible
to pass for other datasources such as Parquet and JSON.



I understand that  CatalystScan can take the all raw expression accessing
to the query planner. However, it is experimental and also it needs
different interfaces (as well as unstable for the reasons such as binary
capability).

As far as I know, Parquet also does not use this.



In general, this can be a issue as a user send a query to data such as

1.

SELECT *
FROM table
WHERE field = 1;


2.

SELECT *
FROM table
WHERE field <=> 1;


The second query can be hugely slow because of large network traffic by not
filtered data from the source RDD.



Also,I could not find a proper issue for this (except for
https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
now binary capability.

Accordingly, I want to add this issue and make a pull request with my codes.


Could you please make any comments for this?

Thanks.

Reply via email to