Hi, I've been playing with UDFs and why they're a blackbox for Spark's optimizer and started with filters to showcase the optimizations in play.
My current understanding is that the predicate pushdowns are supported by the following data sources: 1. Hive tables 2. Parquet files 3. ORC files 4. JDBC While working on examples I came to a conclusion that not only does predicate pushdown work for the data sources mentioned above but solely for DataFrames. That was quite interesting since I was so much into Datasets as strongly type-safe data abstractions in Spark SQL. Can you help me to find the truth? Any links to videos, articles, commits and such to further deepen my understanding of optimizations in Spark SQL 2.0? I'd greatly appreciate. The following query pushes the filter down to Parquet (see PushedFilters attribute at the bottom) scala> cities.filter('name === "Warsaw").queryExecution.executedPlan res30: org.apache.spark.sql.execution.SparkPlan = *Project [id#196L, name#197] +- *Filter (isnotnull(name#197) && (name#197 = Warsaw)) +- *FileScan parquet [id#196L,name#197] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)], ReadSchema: struct<id:bigint,name:string> Why does this not work for Datasets? Is the function/lambda too complex? Are there any examples where it works for Datasets? Are we perhaps trading strong type-safety over optimizations like predicate pushdown (and the feature's are yet to come in the next releases of Spark 2)? scala> cities.as[(Long, String)].filter(_._2 == "Warsaw").queryExecution.executedPlan res31: org.apache.spark.sql.execution.SparkPlan = *Filter <function1>.apply +- *FileScan parquet [id#196L,name#197] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint,name:string> Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org