[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477663#comment-16477663 ]
Wenchen Fan commented on SPARK-24288: ------------------------------------- We need to determine the scope first: 1. add a config to disable predicate pushdown for JDBC data source. 2. have a way to disable operator pushdown for all data sources(using data source v2). 3. have a way to disable optimization for a sub query plan. For 1, it's pretty easy, just define a new option in the JDBC data source. For 2, we need to revisit data source v2 and think about a standard API to disable operator pushdown. This can cover 1 if we migrate JDBC data source to ds v2. For 3, we need to think about the API(both SQL and DataFrame) and the interaction with all the optimizer rules. It can cover 1 and 2 if we can make it. Ideally 3 is a more general approach, but I think it would be a big project. [~maryannxue] can you estimate how long it will take? > Enable preventing predicate pushdown > ------------------------------------ > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.3.0 > Reporter: Tomasz Gawęda > Priority: Major > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org