That sounds fine to me, we already do the filtering so populating that field would be pretty simple.
On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com> wrote: > We have to try and maintain binary compatibility here, so probably the > easiest thing to do here would be to add a method to the class. Perhaps > something like: > > def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters > > By default, this could return all filters so behavior would remain the > same, but specific implementations could override it. There is still a > chance that this would conflict with existing methods, but hopefully that > would not be a problem in practice. > > Thoughts? > > Michael > > On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> Hi! First time poster, long time reader. >> >> I'm wondering if there is a way to let cataylst know that it doesn't need >> to repeat a filter on the spark side after a filter has been applied by the >> Source Implementing PrunedFilterScan. >> >> >> This is for a usecase in which we except a filter on a non-existant >> column that serves as an entry point for our integration with a different >> system. While the source can correctly deal with this, the secondary filter >> done on the RDD itself wipes out the results because the column being >> filtered does not exist. >> >> In particular this is with our integration with Solr where we allow users >> to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') >> there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') >> filters out all of the data since no row's will have that column. >> >> I'm thinking about a few solutions to this but they all seem a little >> hacky >> 1) Try to manually remove the filter step from the query plan after our >> source handles the filter >> 2) Populate the solr_query field being returned so they all automatically >> pass >> >> But I think the real solution is to add a way to create a >> PrunedFilterScan which does not reapply filters if the source doesn't want >> it to. IE Giving PrunedFilterScan the ability to trust the underlying >> source that the filter will be accurately applied. Maybe changing the api >> to >> >> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], >> reapply: Boolean = true) >> >> Where Catalyst can check the Reapply value and not add an RDD.filter if >> it is false. >> >> Thoughts? >> >> Thanks for your time, >> Russ >> > >