Hi, I have been wondering how much Apache Spark 2.2.0 will be improved more again.
This is the prior record from the source code. Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz SQL Single Int Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 215 / 262 73.0 13.7 1.0X SQL Parquet MR 1946 / 2083 8.1 123.7 0.1X So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it. Apache Spark seems to be improved much again. But strangely, MR version is improved much more in general. Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz SQL Single Int Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ SQL Parquet Vectorized 102 / 123 153.7 6.5 1.0X SQL Parquet MR 409 / 436 38.5 26.0 0.3X For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the following. Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz SQL Single Int Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ SQL ORC Vectorized 147 / 153 107.3 9.3 1.0X SQL ORC MR 338 / 369 46.5 21.5 0.4X HIVE ORC MR 408 / 424 38.6 25.9 0.4X Given that this is an initial PR without optimization, ORC Vectorization seems to catch up much. Bests, Dongjoon. From: Dongjoon Hyun <dh...@hortonworks.com<mailto:dh...@hortonworks.com>> Date: Tuesday, May 9, 2017 at 6:15 PM To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>> Subject: Faster Spark on ORC with Apache ORC Hi, All. Apache Spark always has been a fast and general engine, and since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency. With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits. - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support. - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future. - Usability: Users can use `ORC` data sources without hive module (-Phive) - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module. As a first step, I made a PR adding a new ORC data source into `sql/core` module. https://github.com/apache/spark/pull/17924 (+ 3,691 lines, -0) Could you give some opinions on this approach? Bests, Dongjoon.