We have to try and maintain binary compatibility here, so probably the easiest thing to do here would be to add a method to the class. Perhaps something like:
def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters By default, this could return all filters so behavior would remain the same, but specific implementations could override it. There is still a chance that this would conflict with existing methods, but hopefully that would not be a problem in practice. Thoughts? Michael On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <russell.spit...@gmail.com > wrote: > Hi! First time poster, long time reader. > > I'm wondering if there is a way to let cataylst know that it doesn't need > to repeat a filter on the spark side after a filter has been applied by the > Source Implementing PrunedFilterScan. > > > This is for a usecase in which we except a filter on a non-existant column > that serves as an entry point for our integration with a different system. > While the source can correctly deal with this, the secondary filter done on > the RDD itself wipes out the results because the column being filtered does > not exist. > > In particular this is with our integration with Solr where we allow users > to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') > there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') > filters out all of the data since no row's will have that column. > > I'm thinking about a few solutions to this but they all seem a little hacky > 1) Try to manually remove the filter step from the query plan after our > source handles the filter > 2) Populate the solr_query field being returned so they all automatically > pass > > But I think the real solution is to add a way to create a PrunedFilterScan > which does not reapply filters if the source doesn't want it to. IE Giving > PrunedFilterScan the ability to trust the underlying source that the filter > will be accurately applied. Maybe changing the api to > > PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], > reapply: Boolean = true) > > Where Catalyst can check the Reapply value and not add an RDD.filter if it > is false. > > Thoughts? > > Thanks for your time, > Russ >