Should I make up a new ticket for this? Or is there something already underway?
On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > That sounds fine to me, we already do the filtering so populating that > field would be pretty simple. > > On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> We have to try and maintain binary compatibility here, so probably the >> easiest thing to do here would be to add a method to the class. Perhaps >> something like: >> >> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters >> >> By default, this could return all filters so behavior would remain the >> same, but specific implementations could override it. There is still a >> chance that this would conflict with existing methods, but hopefully that >> would not be a problem in practice. >> >> Thoughts? >> >> Michael >> >> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> Hi! First time poster, long time reader. >>> >>> I'm wondering if there is a way to let cataylst know that it doesn't >>> need to repeat a filter on the spark side after a filter has been applied >>> by the Source Implementing PrunedFilterScan. >>> >>> >>> This is for a usecase in which we except a filter on a non-existant >>> column that serves as an entry point for our integration with a different >>> system. While the source can correctly deal with this, the secondary filter >>> done on the RDD itself wipes out the results because the column being >>> filtered does not exist. >>> >>> In particular this is with our integration with Solr where we allow >>> users to pass in a predicate based on "solr_query" ala ("where >>> solr_query='*:*') there is no column "solr_query" so the rdd.filter( >>> row.solr_query == "*:*') filters out all of the data since no row's will >>> have that column. >>> >>> I'm thinking about a few solutions to this but they all seem a little >>> hacky >>> 1) Try to manually remove the filter step from the query plan after our >>> source handles the filter >>> 2) Populate the solr_query field being returned so they all >>> automatically pass >>> >>> But I think the real solution is to add a way to create a >>> PrunedFilterScan which does not reapply filters if the source doesn't want >>> it to. IE Giving PrunedFilterScan the ability to trust the underlying >>> source that the filter will be accurately applied. Maybe changing the api >>> to >>> >>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], >>> reapply: Boolean = true) >>> >>> Where Catalyst can check the Reapply value and not add an RDD.filter if >>> it is false. >>> >>> Thoughts? >>> >>> Thanks for your time, >>> Russ >>> >> >>