The contract of the DataSources API is that filters are advisory and you are allowed to ignore them <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L158>. This is why we always evaluate them ourselves. Have you benchmarked you change? Does it result in a noticeable speed up?
On Tue, Apr 14, 2015 at 7:29 AM, Yijie Shen <henry.yijies...@gmail.com> wrote: > I’ve opened a PR on this: https://github.com/apache/spark/pull/5509 > > On April 14, 2015 at 11:57:34 AM, Yijie Shen (henry.yijies...@gmail.com) > wrote: > > Hi, > > Suppose I have a table t(id: String, event: String) saved as parquet file, > and have directory hierarchy: > hdfs://path/to/data/root/dt=2015-01-01/hr=00 > After partition discovery, the result schema should be (id: String, event: > String, dt: String, hr: Int) > > If I have a query like: > > df.select($“id”).filter(event match).filter($“dt” > > “2015-01-01”).filter($”hr” > 13) > > In current implementation, after (dt > 2015-01-01 && hr >13) is used to > filter partitions, > these two filters remains in execution plan and result in each row > returned from parquet add two fields dt & hr each time, > which I think is useless, if we could rewrite execution.Filter’s predicate > and eliminate them. > > What’s your opinion? Is it a general assumption or it’s just my job’s > specific requirement? > > If it’s a general one, I would love to discuss further about the > implementations. > If specific, I would just make my own workaround :) > > — > Best Regards! > Yijie Shen >