Faster Spark on ORC with Apache ORC

Dong Joon Hyun Tue, 09 May 2017 18:16:07 -0700

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive 
dependency.


With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and 
get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which 
means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on 
ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some 
old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` 
module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.

Faster Spark on ORC with Apache ORC

Reply via email to