Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

Michael Armbrust Wed, 07 Oct 2015 10:06:20 -0700

Please do.

On Wed, Oct 7, 2015 at 9:49 AM, Russell Spitzer <russell.spit...@gmail.com>
wrote:


> Should I make up a new ticket for this? Or is there something already
> underway?
>
> On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> That sounds fine to me, we already do the filtering so populating that
>> field would be pretty simple.
>>
>> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> We have to try and maintain binary compatibility here, so probably the
>>> easiest thing to do here would be to add a method to the class.  Perhaps
>>> something like:
>>>
>>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>>>
>>> By default, this could return all filters so behavior would remain the
>>> same, but specific implementations could override it.  There is still a
>>> chance that this would conflict with existing methods, but hopefully that
>>> would not be a problem in practice.
>>>
>>> Thoughts?
>>>
>>> Michael
>>>
>>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> Hi! First time poster, long time reader.
>>>>
>>>> I'm wondering if there is a way to let cataylst know that it doesn't
>>>> need to repeat a filter on the spark side after a filter has been applied
>>>> by the Source Implementing PrunedFilterScan.
>>>>
>>>>
>>>> This is for a usecase in which we except a filter on a non-existant
>>>> column that serves as an entry point for our integration with a different
>>>> system. While the source can correctly deal with this, the secondary filter
>>>> done on the RDD itself wipes out the results because the column being
>>>> filtered does not exist.
>>>>
>>>> In particular this is with our integration with Solr where we allow
>>>> users to pass in a predicate based on "solr_query" ala ("where
>>>> solr_query='*:*') there is no column "solr_query" so the rdd.filter(
>>>> row.solr_query == "*:*') filters out all of the data since no row's will
>>>> have that column.
>>>>
>>>> I'm thinking about a few solutions to this but they all seem a little
>>>> hacky
>>>> 1) Try to manually remove the filter step from the query plan after our
>>>> source handles the filter
>>>> 2) Populate the solr_query field being returned so they all
>>>> automatically pass
>>>>
>>>> But I think the real solution is to add a way to create a
>>>> PrunedFilterScan which does not reapply filters if the source doesn't want
>>>> it to. IE Giving PrunedFilterScan the ability to trust the underlying
>>>> source that the filter will be accurately applied. Maybe changing the api
>>>> to
>>>>
>>>> PrunedFilterScan(requiredColumns: Array[String], filters:
>>>> Array[Filter], reapply: Boolean = true)
>>>>
>>>> Where Catalyst can check the Reapply value and not add an RDD.filter if
>>>> it is false.
>>>>
>>>> Thoughts?
>>>>
>>>> Thanks for your time,
>>>> Russ
>>>>
>>>
>>>

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

Reply via email to