Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

Russell Spitzer Wed, 07 Oct 2015 09:49:44 -0700

Should I make up a new ticket for this? Or is there something already
underway?


On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> That sounds fine to me, we already do the filtering so populating that
> field would be pretty simple.
>
> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> We have to try and maintain binary compatibility here, so probably the
>> easiest thing to do here would be to add a method to the class.  Perhaps
>> something like:
>>
>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>>
>> By default, this could return all filters so behavior would remain the
>> same, but specific implementations could override it.  There is still a
>> chance that this would conflict with existing methods, but hopefully that
>> would not be a problem in practice.
>>
>> Thoughts?
>>
>> Michael
>>
>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Hi! First time poster, long time reader.
>>>
>>> I'm wondering if there is a way to let cataylst know that it doesn't
>>> need to repeat a filter on the spark side after a filter has been applied
>>> by the Source Implementing PrunedFilterScan.
>>>
>>>
>>> This is for a usecase in which we except a filter on a non-existant
>>> column that serves as an entry point for our integration with a different
>>> system. While the source can correctly deal with this, the secondary filter
>>> done on the RDD itself wipes out the results because the column being
>>> filtered does not exist.
>>>
>>> In particular this is with our integration with Solr where we allow
>>> users to pass in a predicate based on "solr_query" ala ("where
>>> solr_query='*:*') there is no column "solr_query" so the rdd.filter(
>>> row.solr_query == "*:*') filters out all of the data since no row's will
>>> have that column.
>>>
>>> I'm thinking about a few solutions to this but they all seem a little
>>> hacky
>>> 1) Try to manually remove the filter step from the query plan after our
>>> source handles the filter
>>> 2) Populate the solr_query field being returned so they all
>>> automatically pass
>>>
>>> But I think the real solution is to add a way to create a
>>> PrunedFilterScan which does not reapply filters if the source doesn't want
>>> it to. IE Giving PrunedFilterScan the ability to trust the underlying
>>> source that the filter will be accurately applied. Maybe changing the api
>>> to
>>>
>>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
>>> reapply: Boolean = true)
>>>
>>> Where Catalyst can check the Reapply value and not add an RDD.filter if
>>> it is false.
>>>
>>> Thoughts?
>>>
>>> Thanks for your time,
>>> Russ
>>>
>>
>>

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

Reply via email to