Hi,

Assuming I have some data in both ORC/Parquet formats, and some complex 
workflow that eventually combine results of some queries on these datasets, I 
would like to get the best execution and looking at the default configs I 
noticed:

1) Vectorized query execution possible with Parquet only, can you confirm this 
is possible with the ORC format?

parameter spark.sql.parquet.enableVectorizedReader
[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala>
Hive is assuming ORC, parameter hive.vectorized.execution.enabled
[2] https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution 
<https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution>

2) Enabling filter pushdown is by default true for Parquet only, why not also 
for ORC?
spark.sql.parquet.filterPushdown=true
spark.sql.orc.filterPushdown=false

3) Should I even try to process ORC format with Spark at it seems there is 
Parquet native support?


Thank you!

Best,
Ovidiu

Reply via email to