Hi, Nicolas. Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901 (Feature parity for ORC with Parquet). For your questions, the following three are related.
1. spark.sql.orc.impl="native" By default, `native` ORC implementation (based on the latest ORC 1.4.1) is added. The old one is `hive` implementation. 2. spark.sql.orc.enableVectorizedReader="true" By default, `native` ORC implementation uses Vectorized Reader code path if possible. Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only supported only for simple data types. 3. spark.sql.hive.convertMetastoreOrc=true Like Parquet, by default, Hive tables are converted into file-based data sources to use Vectorization technique. Bests, Dongjoon. On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <nipari...@gmail.com> wrote: > Hi > > Thanks for this work. > > Will this affect both: > 1) spark.read.format("orc").load("...") > 2) spark.sql("select ... from my_orc_table_in_hive") > > ? > > > Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait : > > Hi, All. > > > > Vectorized ORC Reader is now supported in Apache Spark 2.3. > > > > https://issues.apache.org/jira/browse/SPARK-16060 > > > > It has been a long journey. From now, Spark can read ORC files faster > without > > feature penalty. > > > > Thank you for all your support, especially Wenchen Fan. > > > > It's done by two commits. > > > > [SPARK-16060][SQL] Support Vectorized ORC Reader > > https://github.com/apache/spark/commit/ > f44ba910f58083458e1133502e193a > > 9d6f2bf766 > > > > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized > orc > > reader > > https://github.com/apache/spark/commit/ > eaac60a1e20e29084b7151ffca964c > > faa5ba99d1 > > > > Please check OrcReadBenchmark for the final speed-up from `Hive built-in > ORC` > > to `Native ORC Vectorized`. > > > > https://github.com/apache/spark/blob/master/sql/hive/ > src/test/scala/org/ > > apache/spark/sql/hive/orc/OrcReadBenchmark.scala > > > > Thank you. > > > > Bests, > > Dongjoon. >