Hi, Assuming I have some data in both ORC/Parquet formats, and some complex workflow that eventually combine results of some queries on these datasets, I would like to get the best execution and looking at the default configs I noticed:
1) Vectorized query execution possible with Parquet only, can you confirm this is possible with the ORC format? parameter spark.sql.parquet.enableVectorizedReader [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala> Hive is assuming ORC, parameter hive.vectorized.execution.enabled [2] https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution> 2) Enabling filter pushdown is by default true for Parquet only, why not also for ORC? spark.sql.parquet.filterPushdown=true spark.sql.orc.filterPushdown=false 3) Should I even try to process ORC format with Spark at it seems there is Parquet native support? Thank you! Best, Ovidiu