[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477663#comment-16477663
 ] 

Wenchen Fan commented on SPARK-24288:
-------------------------------------

We need to determine the scope first:
1. add a config to disable predicate pushdown for JDBC data source.
2. have a way to disable operator pushdown for all data sources(using data 
source v2).
3. have a way to disable optimization for a sub query plan.

For 1, it's pretty easy, just define a new option in the JDBC data source.
For 2, we need to revisit data source v2 and think about a standard API to 
disable operator pushdown. This can cover 1 if we migrate JDBC data source to 
ds v2.
For 3, we need to think about the API(both SQL and DataFrame) and the 
interaction with all the optimizer rules. It can cover 1 and 2 if we can make 
it.

Ideally 3 is a more general approach, but I think it would be a big project. 
[~maryannxue] can you estimate how long it will take?



> Enable preventing predicate pushdown
> ------------------------------------
>
>                 Key: SPARK-24288
>                 URL: https://issues.apache.org/jira/browse/SPARK-24288
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Tomasz Gawęda
>            Priority: Major
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to