applying predict pushdown is an optimization, and it makes sense to provide configs to turn off certain optimizations. Feel free to create a JIRA.
Thanks, Wenchen On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda <tomasz.gaw...@outlook.com> wrote: > Hi, > > while working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times). > > > I asked on StackOverflow for better ideas: > https://stackoverflow.com/questions/50336355/how-to- > prevent-predicate-pushdown > > However, there are only workarounds. I can use those workarounds, but > maybe it would be better to add such functionality directly in the API? > > For example: df.withAnalysisBarrier().where(...) ? > > Please let me know if I should create a JIRA or it's not a good idea for > some reasons. > > > Pozdrawiam / Best regards, > > Tomek Gawęda > >