You can create a new Issue and send a pull request for the same i think. + dev list
Thanks Best Regards On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Dear Sir / Madam, > > I have a plan to contribute some codes about passing filters to a > datasource as physical planning. > > In more detail, I understand when we want to build up filter operations > from data like Parquet (when actually reading and filtering HDFS blocks at > first not filtering in memory with Spark operations), we need to implement > > PrunedFilteredScan, PrunedScan or CatalystScan in package > org.apache.spark.sql.sources. > > > > For PrunedFilteredScan and PrunedScan, it pass the filter objects in package > org.apache.spark.sql.sources, which do not access directly to the query > parser but are objects built by selectFilters() in package > org.apache.spark.sql.sources.DataSourceStrategy. > > It looks all the filters (rather raw expressions) do not pass to the > function below in PrunedFilteredScan and PrunedScan. > > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > > The passing filters in here are defined in package > org.apache.spark.sql.sources. > > On the other hand, it does not pass EqualNullSafe filter in package > org.apache.spark.sql.catalyst.expressions even though this looks possible > to pass for other datasources such as Parquet and JSON. > > > > I understand that CatalystScan can take the all raw expression accessing > to the query planner. However, it is experimental and also it needs > different interfaces (as well as unstable for the reasons such as binary > capability). > > As far as I know, Parquet also does not use this. > > > > In general, this can be a issue as a user send a query to data such as > > 1. > > SELECT * > FROM table > WHERE field = 1; > > > 2. > > SELECT * > FROM table > WHERE field <=> 1; > > > The second query can be hugely slow because of large network traffic by > not filtered data from the source RDD. > > > > Also,I could not find a proper issue for this (except for > https://issues.apache.org/jira/browse/SPARK-8747) which says it supports > now binary capability. > > Accordingly, I want to add this issue and make a pull request with my > codes. > > > Could you please make any comments for this? > > Thanks. > >