3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

Jacek Laskowski Tue, 30 Aug 2016 01:20:40 -0700

Hi,

I've been playing with UDFs and why they're a blackbox for Spark's
optimizer and started with filters to showcase the optimizations in
play.


My current understanding is that the predicate pushdowns are supported
by the following data sources:

1. Hive tables
2. Parquet files
3. ORC files
4. JDBC

While working on examples I came to a conclusion that not only does
predicate pushdown work for the data sources mentioned above but
solely for DataFrames. That was quite interesting since I was so much
into Datasets as strongly type-safe data abstractions in Spark SQL.

Can you help me to find the truth? Any links to videos, articles,
commits and such to further deepen my understanding of optimizations
in Spark SQL 2.0? I'd greatly appreciate.

The following query pushes the filter down to Parquet (see
PushedFilters attribute at the bottom)

scala> cities.filter('name === "Warsaw").queryExecution.executedPlan
res30: org.apache.spark.sql.execution.SparkPlan =
*Project [id#196L, name#197]
+- *Filter (isnotnull(name#197) && (name#197 = Warsaw))
   +- *FileScan parquet [id#196L,name#197] Batched: true, Format:
ParquetFormat, InputPaths:
file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)], ReadSchema:
struct<id:bigint,name:string>

Why does this not work for Datasets? Is the function/lambda too
complex? Are there any examples where it works for Datasets? Are we
perhaps trading strong type-safety over optimizations like predicate
pushdown (and the feature's are yet to come in the next releases of
Spark 2)?

scala> cities.as[(Long, String)].filter(_._2 ==
"Warsaw").queryExecution.executedPlan
res31: org.apache.spark.sql.execution.SparkPlan =
*Filter <function1>.apply
+- *FileScan parquet [id#196L,name#197] Batched: true, Format:
ParquetFormat, InputPaths:
file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<id:bigint,name:string>

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

Reply via email to