[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000112#comment-15000112
 ] 

Hyukjin Kwon commented on SPARK-10978:
--------------------------------------

I am sorry to add some comments more here but.. are you sure that it sill 
pushes down filters?

I have tested many times (including end-to-end, function unittest and debuged 
with IDE for checking the filter objects)

 but it looks it does not although maybe I am missing something.

I separatly tested the test codes written at Spark and, all the test codes 
related with this do not throw exceptions even If the filters are not pushed 
down or pushdown is disabled. 

Would you like to try this code under {{ParquetFilterSuite}}?

{code:scala}
  test("actual filter push down") {
    import testImplicits._
    withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
      withTempPath { dir =>
        val path = s"${dir.getCanonicalPath}/part=1"
        (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
        val df = sqlContext.read.parquet(path).filter("a = 2")

        // This is the source RDD without Spark-side filtering.
        val childRDD = 
df.queryExecution.executedPlan.asInstanceOf[Filter].child.execute()

        // The result should be single row.
        assert(childRDD.count == 1)
      }
    }
  }
{code}

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> ----------------------------------------------------------------------
>
>                 Key: SPARK-10978
>                 URL: https://issues.apache.org/jira/browse/SPARK-10978
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 1.3.0, 1.4.0, 1.5.0
>            Reporter: Russell Alexander Spitzer
>            Assignee: Cheng Lian
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to