Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

Russell Spitzer Mon, 05 Oct 2015 16:32:23 -0700

That sounds fine to me, we already do the filtering so populating that
field would be pretty simple.


On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com>
wrote:

> We have to try and maintain binary compatibility here, so probably the
> easiest thing to do here would be to add a method to the class.  Perhaps
> something like:
>
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>
> By default, this could return all filters so behavior would remain the
> same, but specific implementations could override it.  There is still a
> chance that this would conflict with existing methods, but hopefully that
> would not be a problem in practice.
>
> Thoughts?
>
> Michael
>
> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Hi! First time poster, long time reader.
>>
>> I'm wondering if there is a way to let cataylst know that it doesn't need
>> to repeat a filter on the spark side after a filter has been applied by the
>> Source Implementing PrunedFilterScan.
>>
>>
>> This is for a usecase in which we except a filter on a non-existant
>> column that serves as an entry point for our integration with a different
>> system. While the source can correctly deal with this, the secondary filter
>> done on the RDD itself wipes out the results because the column being
>> filtered does not exist.
>>
>> In particular this is with our integration with Solr where we allow users
>> to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
>> there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
>> filters out all of the data since no row's will have that column.
>>
>> I'm thinking about a few solutions to this but they all seem a little
>> hacky
>> 1) Try to manually remove the filter step from the query plan after our
>> source handles the filter
>> 2) Populate the solr_query field being returned so they all automatically
>> pass
>>
>> But I think the real solution is to add a way to create a
>> PrunedFilterScan which does not reapply filters if the source doesn't want
>> it to. IE Giving PrunedFilterScan the ability to trust the underlying
>> source that the filter will be accurately applied. Maybe changing the api
>> to
>>
>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
>> reapply: Boolean = true)
>>
>> Where Catalyst can check the Reapply value and not add an RDD.filter if
>> it is false.
>>
>> Thoughts?
>>
>> Thanks for your time,
>> Russ
>>
>
>

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

Reply via email to