[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

Yin Huai (JIRA) Tue, 10 Nov 2015 19:47:55 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999899#comment-14999899
 ]


Yin Huai commented on SPARK-10978:
----------------------------------

I am not quite sure if I understand your question correctly. But, let me try to 
explain the semantic of {{unhandledFilters}}. If a {{Filter}} is a part of 
{{unhandledFilters}}, it does not mean that spark sql does not push it down. 
The meaning is that Spark SQL still try to let the data source know there is 
such a filter, but Spark SQL does not know if this data source applies that 
filter to every row. So, even for Parquet and ORC, unhandled filters return all 
Filters. We still push those Filters to Parquet and ORC. 

Also, for re-evaluating, I meant the Filter operator on top of the table 
operator.

Basically, for {{unhandledFilters}}, we just want to give data source a chance 
to let us know that Filters that are not returned by this method will 
definitely be applied to every row in the data source (and potentially these 
filters are expansive).

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> ----------------------------------------------------------------------
>
>                 Key: SPARK-10978
>                 URL: https://issues.apache.org/jira/browse/SPARK-10978
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 1.3.0, 1.4.0, 1.5.0
>            Reporter: Russell Alexander Spitzer
>            Assignee: Cheng Lian
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

Reply via email to