Hi, All.
Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive
dependency.
With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and
get some benefits.
- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which
means full vectorization support.
- Stability: Apache ORC 1.4.0 already has many fixes and we can depend on
ORC community effort in the future.
- Usability: Users can use `ORC` data sources without hive module (-Phive)
- Maintainability: Reduce the Hive dependency and eventually remove some
old legacy code from `sql/hive` module.
As a first step, I made a PR adding a new ORC data source into `sql/core`
module.
https://github.com/apache/spark/pull/17924 (+ 3,691 lines, -0)
Could you give some opinions on this approach?
Bests,
Dongjoon.