The contract of the DataSources API is that filters are advisory and you
are allowed to ignore them
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L158>.
This is why we always evaluate them ourselves.  Have you benchmarked you
change?  Does it result in a noticeable speed up?

On Tue, Apr 14, 2015 at 7:29 AM, Yijie Shen <henry.yijies...@gmail.com>
wrote:

> I’ve opened a PR on this: https://github.com/apache/spark/pull/5509
>
> On April 14, 2015 at 11:57:34 AM, Yijie Shen (henry.yijies...@gmail.com)
> wrote:
>
> Hi,
>
> Suppose I have a table t(id: String, event: String) saved as parquet file,
> and have directory hierarchy:
> hdfs://path/to/data/root/dt=2015-01-01/hr=00
> After partition discovery, the result schema should be (id: String, event:
> String, dt: String, hr: Int)
>
> If I have a query like:
>
> df.select($“id”).filter(event match).filter($“dt” >
> “2015-01-01”).filter($”hr” > 13)
>
> In current implementation, after (dt > 2015-01-01 && hr >13) is used to
> filter partitions,
> these two filters remains in execution plan and result in each row
> returned from parquet add two fields dt & hr each time,
> which I think is useless, if we could rewrite execution.Filter’s predicate
> and eliminate them.
>
> What’s your opinion? Is it a general assumption or it’s just my job’s
> specific requirement?
>
> If it’s a general one, I would love to discuss further about the
> implementations.
> If specific, I would just make my own workaround :)
>
> —
> Best Regards!
> Yijie Shen
>

Reply via email to