Please do. On Wed, Oct 7, 2015 at 9:49 AM, Russell Spitzer <russell.spit...@gmail.com> wrote:
> Should I make up a new ticket for this? Or is there something already > underway? > > On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> That sounds fine to me, we already do the filtering so populating that >> field would be pretty simple. >> >> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> We have to try and maintain binary compatibility here, so probably the >>> easiest thing to do here would be to add a method to the class. Perhaps >>> something like: >>> >>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters >>> >>> By default, this could return all filters so behavior would remain the >>> same, but specific implementations could override it. There is still a >>> chance that this would conflict with existing methods, but hopefully that >>> would not be a problem in practice. >>> >>> Thoughts? >>> >>> Michael >>> >>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> Hi! First time poster, long time reader. >>>> >>>> I'm wondering if there is a way to let cataylst know that it doesn't >>>> need to repeat a filter on the spark side after a filter has been applied >>>> by the Source Implementing PrunedFilterScan. >>>> >>>> >>>> This is for a usecase in which we except a filter on a non-existant >>>> column that serves as an entry point for our integration with a different >>>> system. While the source can correctly deal with this, the secondary filter >>>> done on the RDD itself wipes out the results because the column being >>>> filtered does not exist. >>>> >>>> In particular this is with our integration with Solr where we allow >>>> users to pass in a predicate based on "solr_query" ala ("where >>>> solr_query='*:*') there is no column "solr_query" so the rdd.filter( >>>> row.solr_query == "*:*') filters out all of the data since no row's will >>>> have that column. >>>> >>>> I'm thinking about a few solutions to this but they all seem a little >>>> hacky >>>> 1) Try to manually remove the filter step from the query plan after our >>>> source handles the filter >>>> 2) Populate the solr_query field being returned so they all >>>> automatically pass >>>> >>>> But I think the real solution is to add a way to create a >>>> PrunedFilterScan which does not reapply filters if the source doesn't want >>>> it to. IE Giving PrunedFilterScan the ability to trust the underlying >>>> source that the filter will be accurately applied. Maybe changing the api >>>> to >>>> >>>> PrunedFilterScan(requiredColumns: Array[String], filters: >>>> Array[Filter], reapply: Boolean = true) >>>> >>>> Where Catalyst can check the Reapply value and not add an RDD.filter if >>>> it is false. >>>> >>>> Thoughts? >>>> >>>> Thanks for your time, >>>> Russ >>>> >>> >>>