Re: Dataframes: PrunedFilteredScan without Spark Side Filtering
Please do. On Wed, Oct 7, 2015 at 9:49 AM, Russell Spitzerwrote: > Should I make up a new ticket for this? Or is there something already > underway? > > On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer > wrote: > >> That sounds fine to me, we already do the filtering so populating that >> field would be pretty simple. >> >> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust >> wrote: >> >>> We have to try and maintain binary compatibility here, so probably the >>> easiest thing to do here would be to add a method to the class. Perhaps >>> something like: >>> >>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters >>> >>> By default, this could return all filters so behavior would remain the >>> same, but specific implementations could override it. There is still a >>> chance that this would conflict with existing methods, but hopefully that >>> would not be a problem in practice. >>> >>> Thoughts? >>> >>> Michael >>> >>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> Hi! First time poster, long time reader. I'm wondering if there is a way to let cataylst know that it doesn't need to repeat a filter on the spark side after a filter has been applied by the Source Implementing PrunedFilterScan. This is for a usecase in which we except a filter on a non-existant column that serves as an entry point for our integration with a different system. While the source can correctly deal with this, the secondary filter done on the RDD itself wipes out the results because the column being filtered does not exist. In particular this is with our integration with Solr where we allow users to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') filters out all of the data since no row's will have that column. I'm thinking about a few solutions to this but they all seem a little hacky 1) Try to manually remove the filter step from the query plan after our source handles the filter 2) Populate the solr_query field being returned so they all automatically pass But I think the real solution is to add a way to create a PrunedFilterScan which does not reapply filters if the source doesn't want it to. IE Giving PrunedFilterScan the ability to trust the underlying source that the filter will be accurately applied. Maybe changing the api to PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], reapply: Boolean = true) Where Catalyst can check the Reapply value and not add an RDD.filter if it is false. Thoughts? Thanks for your time, Russ >>> >>>
Re: Dataframes: PrunedFilteredScan without Spark Side Filtering
Should I make up a new ticket for this? Or is there something already underway? On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzerwrote: > That sounds fine to me, we already do the filtering so populating that > field would be pretty simple. > > On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust > wrote: > >> We have to try and maintain binary compatibility here, so probably the >> easiest thing to do here would be to add a method to the class. Perhaps >> something like: >> >> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters >> >> By default, this could return all filters so behavior would remain the >> same, but specific implementations could override it. There is still a >> chance that this would conflict with existing methods, but hopefully that >> would not be a problem in practice. >> >> Thoughts? >> >> Michael >> >> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> Hi! First time poster, long time reader. >>> >>> I'm wondering if there is a way to let cataylst know that it doesn't >>> need to repeat a filter on the spark side after a filter has been applied >>> by the Source Implementing PrunedFilterScan. >>> >>> >>> This is for a usecase in which we except a filter on a non-existant >>> column that serves as an entry point for our integration with a different >>> system. While the source can correctly deal with this, the secondary filter >>> done on the RDD itself wipes out the results because the column being >>> filtered does not exist. >>> >>> In particular this is with our integration with Solr where we allow >>> users to pass in a predicate based on "solr_query" ala ("where >>> solr_query='*:*') there is no column "solr_query" so the rdd.filter( >>> row.solr_query == "*:*') filters out all of the data since no row's will >>> have that column. >>> >>> I'm thinking about a few solutions to this but they all seem a little >>> hacky >>> 1) Try to manually remove the filter step from the query plan after our >>> source handles the filter >>> 2) Populate the solr_query field being returned so they all >>> automatically pass >>> >>> But I think the real solution is to add a way to create a >>> PrunedFilterScan which does not reapply filters if the source doesn't want >>> it to. IE Giving PrunedFilterScan the ability to trust the underlying >>> source that the filter will be accurately applied. Maybe changing the api >>> to >>> >>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], >>> reapply: Boolean = true) >>> >>> Where Catalyst can check the Reapply value and not add an RDD.filter if >>> it is false. >>> >>> Thoughts? >>> >>> Thanks for your time, >>> Russ >>> >> >>
Re: Dataframes: PrunedFilteredScan without Spark Side Filtering
That sounds fine to me, we already do the filtering so populating that field would be pretty simple. On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrustwrote: > We have to try and maintain binary compatibility here, so probably the > easiest thing to do here would be to add a method to the class. Perhaps > something like: > > def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters > > By default, this could return all filters so behavior would remain the > same, but specific implementations could override it. There is still a > chance that this would conflict with existing methods, but hopefully that > would not be a problem in practice. > > Thoughts? > > Michael > > On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> Hi! First time poster, long time reader. >> >> I'm wondering if there is a way to let cataylst know that it doesn't need >> to repeat a filter on the spark side after a filter has been applied by the >> Source Implementing PrunedFilterScan. >> >> >> This is for a usecase in which we except a filter on a non-existant >> column that serves as an entry point for our integration with a different >> system. While the source can correctly deal with this, the secondary filter >> done on the RDD itself wipes out the results because the column being >> filtered does not exist. >> >> In particular this is with our integration with Solr where we allow users >> to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') >> there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') >> filters out all of the data since no row's will have that column. >> >> I'm thinking about a few solutions to this but they all seem a little >> hacky >> 1) Try to manually remove the filter step from the query plan after our >> source handles the filter >> 2) Populate the solr_query field being returned so they all automatically >> pass >> >> But I think the real solution is to add a way to create a >> PrunedFilterScan which does not reapply filters if the source doesn't want >> it to. IE Giving PrunedFilterScan the ability to trust the underlying >> source that the filter will be accurately applied. Maybe changing the api >> to >> >> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], >> reapply: Boolean = true) >> >> Where Catalyst can check the Reapply value and not add an RDD.filter if >> it is false. >> >> Thoughts? >> >> Thanks for your time, >> Russ >> > >
Re: Dataframes: PrunedFilteredScan without Spark Side Filtering
We have to try and maintain binary compatibility here, so probably the easiest thing to do here would be to add a method to the class. Perhaps something like: def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters By default, this could return all filters so behavior would remain the same, but specific implementations could override it. There is still a chance that this would conflict with existing methods, but hopefully that would not be a problem in practice. Thoughts? Michael On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzerwrote: > Hi! First time poster, long time reader. > > I'm wondering if there is a way to let cataylst know that it doesn't need > to repeat a filter on the spark side after a filter has been applied by the > Source Implementing PrunedFilterScan. > > > This is for a usecase in which we except a filter on a non-existant column > that serves as an entry point for our integration with a different system. > While the source can correctly deal with this, the secondary filter done on > the RDD itself wipes out the results because the column being filtered does > not exist. > > In particular this is with our integration with Solr where we allow users > to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') > there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') > filters out all of the data since no row's will have that column. > > I'm thinking about a few solutions to this but they all seem a little hacky > 1) Try to manually remove the filter step from the query plan after our > source handles the filter > 2) Populate the solr_query field being returned so they all automatically > pass > > But I think the real solution is to add a way to create a PrunedFilterScan > which does not reapply filters if the source doesn't want it to. IE Giving > PrunedFilterScan the ability to trust the underlying source that the filter > will be accurately applied. Maybe changing the api to > > PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], > reapply: Boolean = true) > > Where Catalyst can check the Reapply value and not add an RDD.filter if it > is false. > > Thoughts? > > Thanks for your time, > Russ >
Dataframes: PrunedFilteredScan without Spark Side Filtering
Hi! First time poster, long time reader. I'm wondering if there is a way to let cataylst know that it doesn't need to repeat a filter on the spark side after a filter has been applied by the Source Implementing PrunedFilterScan. This is for a usecase in which we except a filter on a non-existant column that serves as an entry point for our integration with a different system. While the source can correctly deal with this, the secondary filter done on the RDD itself wipes out the results because the column being filtered does not exist. In particular this is with our integration with Solr where we allow users to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') filters out all of the data since no row's will have that column. I'm thinking about a few solutions to this but they all seem a little hacky 1) Try to manually remove the filter step from the query plan after our source handles the filter 2) Populate the solr_query field being returned so they all automatically pass But I think the real solution is to add a way to create a PrunedFilterScan which does not reapply filters if the source doesn't want it to. IE Giving PrunedFilterScan the ability to trust the underlying source that the filter will be accurately applied. Maybe changing the api to PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], reapply: Boolean = true) Where Catalyst can check the Reapply value and not add an RDD.filter if it is false. Thoughts? Thanks for your time, Russ