Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

Reynold Xin Tue, 30 Aug 2016 01:24:08 -0700

The UDF is a black box so Spark can't know what it is dealing with. There
are simple cases in which we can analyze the UDFs byte code and infer what
it is doing, but it is pretty difficult to do in general.


On Tuesday, August 30, 2016, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I've been playing with UDFs and why they're a blackbox for Spark's
> optimizer and started with filters to showcase the optimizations in
> play.
>
> My current understanding is that the predicate pushdowns are supported
> by the following data sources:
>
> 1. Hive tables
> 2. Parquet files
> 3. ORC files
> 4. JDBC
>
> While working on examples I came to a conclusion that not only does
> predicate pushdown work for the data sources mentioned above but
> solely for DataFrames. That was quite interesting since I was so much
> into Datasets as strongly type-safe data abstractions in Spark SQL.
>
> Can you help me to find the truth? Any links to videos, articles,
> commits and such to further deepen my understanding of optimizations
> in Spark SQL 2.0? I'd greatly appreciate.
>
> The following query pushes the filter down to Parquet (see
> PushedFilters attribute at the bottom)
>
> scala> cities.filter('name === "Warsaw").queryExecution.executedPlan
> res30: org.apache.spark.sql.execution.SparkPlan =
> *Project [id#196L, name#197]
> +- *Filter (isnotnull(name#197) && (name#197 = Warsaw))
>    +- *FileScan parquet [id#196L,name#197] Batched: true, Format:
> ParquetFormat, InputPaths:
> file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
> PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)], ReadSchema:
> struct<id:bigint,name:string>
>
> Why does this not work for Datasets? Is the function/lambda too
> complex? Are there any examples where it works for Datasets? Are we
> perhaps trading strong type-safety over optimizations like predicate
> pushdown (and the feature's are yet to come in the next releases of
> Spark 2)?
>
> scala> cities.as[(Long, String)].filter(_._2 ==
> "Warsaw").queryExecution.executedPlan
> res31: org.apache.spark.sql.execution.SparkPlan =
> *Filter <function1>.apply
> +- *FileScan parquet [id#196L,name#197] Batched: true, Format:
> ParquetFormat, InputPaths:
> file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
> PushedFilters: [], ReadSchema: struct<id:bigint,name:string>
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <javascript:;>
>
>

Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

Reply via email to