Dataframes: PrunedFilteredScan without Spark Side Filtering

Russell Spitzer Fri, 25 Sep 2015 22:03:07 -0700

Hi! First time poster, long time reader.

I'm wondering if there is a way to let cataylst know that it doesn't need
to repeat a filter on the spark side after a filter has been applied by the
Source Implementing PrunedFilterScan.



This is for a usecase in which we except a filter on a non-existant column
that serves as an entry point for our integration with a different system.
While the source can correctly deal with this, the secondary filter done on
the RDD itself wipes out the results because the column being filtered does
not exist.

In particular this is with our integration with Solr where we allow users
to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
filters out all of the data since no row's will have that column.

I'm thinking about a few solutions to this but they all seem a little hacky
1) Try to manually remove the filter step from the query plan after our
source handles the filter
2) Populate the solr_query field being returned so they all automatically
pass

But I think the real solution is to add a way to create a PrunedFilterScan
which does not reapply filters if the source doesn't want it to. IE Giving
PrunedFilterScan the ability to trust the underlying source that the filter
will be accurately applied. Maybe changing the api to

PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
reapply: Boolean = true)

Where Catalyst can check the Reapply value and not add an RDD.filter if it
is false.

Thoughts?

Thanks for your time,
Russ

Dataframes: PrunedFilteredScan without Spark Side Filtering

Reply via email to