Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Michael Armbrust
Please do.

On Wed, Oct 7, 2015 at 9:49 AM, Russell Spitzer 
wrote:

> Should I make up a new ticket for this? Or is there something already
> underway?
>
> On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer 
> wrote:
>
>> That sounds fine to me, we already do the filtering so populating that
>> field would be pretty simple.
>>
>> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust 
>> wrote:
>>
>>> We have to try and maintain binary compatibility here, so probably the
>>> easiest thing to do here would be to add a method to the class.  Perhaps
>>> something like:
>>>
>>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>>>
>>> By default, this could return all filters so behavior would remain the
>>> same, but specific implementations could override it.  There is still a
>>> chance that this would conflict with existing methods, but hopefully that
>>> would not be a problem in practice.
>>>
>>> Thoughts?
>>>
>>> Michael
>>>
>>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 Hi! First time poster, long time reader.

 I'm wondering if there is a way to let cataylst know that it doesn't
 need to repeat a filter on the spark side after a filter has been applied
 by the Source Implementing PrunedFilterScan.


 This is for a usecase in which we except a filter on a non-existant
 column that serves as an entry point for our integration with a different
 system. While the source can correctly deal with this, the secondary filter
 done on the RDD itself wipes out the results because the column being
 filtered does not exist.

 In particular this is with our integration with Solr where we allow
 users to pass in a predicate based on "solr_query" ala ("where
 solr_query='*:*') there is no column "solr_query" so the rdd.filter(
 row.solr_query == "*:*') filters out all of the data since no row's will
 have that column.

 I'm thinking about a few solutions to this but they all seem a little
 hacky
 1) Try to manually remove the filter step from the query plan after our
 source handles the filter
 2) Populate the solr_query field being returned so they all
 automatically pass

 But I think the real solution is to add a way to create a
 PrunedFilterScan which does not reapply filters if the source doesn't want
 it to. IE Giving PrunedFilterScan the ability to trust the underlying
 source that the filter will be accurately applied. Maybe changing the api
 to

 PrunedFilterScan(requiredColumns: Array[String], filters:
 Array[Filter], reapply: Boolean = true)

 Where Catalyst can check the Reapply value and not add an RDD.filter if
 it is false.

 Thoughts?

 Thanks for your time,
 Russ

>>>
>>>


Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Russell Spitzer
Should I make up a new ticket for this? Or is there something already
underway?

On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer 
wrote:

> That sounds fine to me, we already do the filtering so populating that
> field would be pretty simple.
>
> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust 
> wrote:
>
>> We have to try and maintain binary compatibility here, so probably the
>> easiest thing to do here would be to add a method to the class.  Perhaps
>> something like:
>>
>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>>
>> By default, this could return all filters so behavior would remain the
>> same, but specific implementations could override it.  There is still a
>> chance that this would conflict with existing methods, but hopefully that
>> would not be a problem in practice.
>>
>> Thoughts?
>>
>> Michael
>>
>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Hi! First time poster, long time reader.
>>>
>>> I'm wondering if there is a way to let cataylst know that it doesn't
>>> need to repeat a filter on the spark side after a filter has been applied
>>> by the Source Implementing PrunedFilterScan.
>>>
>>>
>>> This is for a usecase in which we except a filter on a non-existant
>>> column that serves as an entry point for our integration with a different
>>> system. While the source can correctly deal with this, the secondary filter
>>> done on the RDD itself wipes out the results because the column being
>>> filtered does not exist.
>>>
>>> In particular this is with our integration with Solr where we allow
>>> users to pass in a predicate based on "solr_query" ala ("where
>>> solr_query='*:*') there is no column "solr_query" so the rdd.filter(
>>> row.solr_query == "*:*') filters out all of the data since no row's will
>>> have that column.
>>>
>>> I'm thinking about a few solutions to this but they all seem a little
>>> hacky
>>> 1) Try to manually remove the filter step from the query plan after our
>>> source handles the filter
>>> 2) Populate the solr_query field being returned so they all
>>> automatically pass
>>>
>>> But I think the real solution is to add a way to create a
>>> PrunedFilterScan which does not reapply filters if the source doesn't want
>>> it to. IE Giving PrunedFilterScan the ability to trust the underlying
>>> source that the filter will be accurately applied. Maybe changing the api
>>> to
>>>
>>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
>>> reapply: Boolean = true)
>>>
>>> Where Catalyst can check the Reapply value and not add an RDD.filter if
>>> it is false.
>>>
>>> Thoughts?
>>>
>>> Thanks for your time,
>>> Russ
>>>
>>
>>


Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-05 Thread Russell Spitzer
That sounds fine to me, we already do the filtering so populating that
field would be pretty simple.

On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust 
wrote:

> We have to try and maintain binary compatibility here, so probably the
> easiest thing to do here would be to add a method to the class.  Perhaps
> something like:
>
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>
> By default, this could return all filters so behavior would remain the
> same, but specific implementations could override it.  There is still a
> chance that this would conflict with existing methods, but hopefully that
> would not be a problem in practice.
>
> Thoughts?
>
> Michael
>
> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Hi! First time poster, long time reader.
>>
>> I'm wondering if there is a way to let cataylst know that it doesn't need
>> to repeat a filter on the spark side after a filter has been applied by the
>> Source Implementing PrunedFilterScan.
>>
>>
>> This is for a usecase in which we except a filter on a non-existant
>> column that serves as an entry point for our integration with a different
>> system. While the source can correctly deal with this, the secondary filter
>> done on the RDD itself wipes out the results because the column being
>> filtered does not exist.
>>
>> In particular this is with our integration with Solr where we allow users
>> to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
>> there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
>> filters out all of the data since no row's will have that column.
>>
>> I'm thinking about a few solutions to this but they all seem a little
>> hacky
>> 1) Try to manually remove the filter step from the query plan after our
>> source handles the filter
>> 2) Populate the solr_query field being returned so they all automatically
>> pass
>>
>> But I think the real solution is to add a way to create a
>> PrunedFilterScan which does not reapply filters if the source doesn't want
>> it to. IE Giving PrunedFilterScan the ability to trust the underlying
>> source that the filter will be accurately applied. Maybe changing the api
>> to
>>
>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
>> reapply: Boolean = true)
>>
>> Where Catalyst can check the Reapply value and not add an RDD.filter if
>> it is false.
>>
>> Thoughts?
>>
>> Thanks for your time,
>> Russ
>>
>
>


Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-09-27 Thread Michael Armbrust
We have to try and maintain binary compatibility here, so probably the
easiest thing to do here would be to add a method to the class.  Perhaps
something like:

def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters

By default, this could return all filters so behavior would remain the
same, but specific implementations could override it.  There is still a
chance that this would conflict with existing methods, but hopefully that
would not be a problem in practice.

Thoughts?

Michael

On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer  wrote:

> Hi! First time poster, long time reader.
>
> I'm wondering if there is a way to let cataylst know that it doesn't need
> to repeat a filter on the spark side after a filter has been applied by the
> Source Implementing PrunedFilterScan.
>
>
> This is for a usecase in which we except a filter on a non-existant column
> that serves as an entry point for our integration with a different system.
> While the source can correctly deal with this, the secondary filter done on
> the RDD itself wipes out the results because the column being filtered does
> not exist.
>
> In particular this is with our integration with Solr where we allow users
> to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
> there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
> filters out all of the data since no row's will have that column.
>
> I'm thinking about a few solutions to this but they all seem a little hacky
> 1) Try to manually remove the filter step from the query plan after our
> source handles the filter
> 2) Populate the solr_query field being returned so they all automatically
> pass
>
> But I think the real solution is to add a way to create a PrunedFilterScan
> which does not reapply filters if the source doesn't want it to. IE Giving
> PrunedFilterScan the ability to trust the underlying source that the filter
> will be accurately applied. Maybe changing the api to
>
> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
> reapply: Boolean = true)
>
> Where Catalyst can check the Reapply value and not add an RDD.filter if it
> is false.
>
> Thoughts?
>
> Thanks for your time,
> Russ
>


Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-09-25 Thread Russell Spitzer
Hi! First time poster, long time reader.

I'm wondering if there is a way to let cataylst know that it doesn't need
to repeat a filter on the spark side after a filter has been applied by the
Source Implementing PrunedFilterScan.


This is for a usecase in which we except a filter on a non-existant column
that serves as an entry point for our integration with a different system.
While the source can correctly deal with this, the secondary filter done on
the RDD itself wipes out the results because the column being filtered does
not exist.

In particular this is with our integration with Solr where we allow users
to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
filters out all of the data since no row's will have that column.

I'm thinking about a few solutions to this but they all seem a little hacky
1) Try to manually remove the filter step from the query plan after our
source handles the filter
2) Populate the solr_query field being returned so they all automatically
pass

But I think the real solution is to add a way to create a PrunedFilterScan
which does not reapply filters if the source doesn't want it to. IE Giving
PrunedFilterScan the ability to trust the underlying source that the filter
will be accurately applied. Maybe changing the api to

PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
reapply: Boolean = true)

Where Catalyst can check the Reapply value and not add an RDD.filter if it
is false.

Thoughts?

Thanks for your time,
Russ