Hi, All.
As a continuation of SPARK-20682(Support a new faster ORC data source based on
Apache ORC), I would like to suggest to make the default ORCFileFormat
configurable between sql/hive and sql/core for the followings.
spark.read.orc(...)
spark.write.orc(...)
CREATE TABLE t
USING ORC
...
It's filed as SPARK-20728 and I made a PR for that, too.
In the new PR,
- You can test not only the PR but also your apps more easily with that
option.
- To help reviews, the PR includes the updated benchmarks for both
ORCReadBenchmark and ParquetReadBenchmark.
Since the previous PR is on-going, new PR inevitably have some of the previous
PR.
I'll remove the duplication later in any ways.
Any opinions for Spark ORC improvement are welcome!
Thanks,
Dongjoon.?
________________________________
From: Dong Joon Hyun <[email protected]>
Sent: Friday, May 12, 2017 10:49 AM
To: [email protected]
Subject: Re: Faster Spark on ORC with Apache ORC
Hi,
I have been wondering how much Apache Spark 2.2.0 will be improved more again.
This is the prior record from the source code.
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
SQL Single Int Column Scan: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 215 / 262 73.0
13.7 1.0X
SQL Parquet MR 1946 / 2083 8.1
123.7 0.1X
So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.
Apache Spark seems to be improved much again. But strangely, MR version is
improved much more in general.
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
SQL Single Int Column Scan: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
SQL Parquet Vectorized 102 / 123 153.7
6.5 1.0X
SQL Parquet MR 409 / 436 38.5
26.0 0.3X
For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the
following.
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
SQL Single Int Column Scan: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
SQL ORC Vectorized 147 / 153 107.3
9.3 1.0X
SQL ORC MR 338 / 369 46.5
21.5 0.4X
HIVE ORC MR 408 / 424 38.6
25.9 0.4X
Given that this is an initial PR without optimization, ORC Vectorization seems
to catch up much.
Bests,
Dongjoon.
From: Dongjoon Hyun <[email protected]<mailto:[email protected]>>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Faster Spark on ORC with Apache ORC
Hi, All.
Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive
dependency.
With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and
get some benefits.
- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which
means full vectorization support.
- Stability: Apache ORC 1.4.0 already has many fixes and we can depend on
ORC community effort in the future.
- Usability: Users can use `ORC` data sources without hive module (-Phive)
- Maintainability: Reduce the Hive dependency and eventually remove some
old legacy code from `sql/hive` module.
As a first step, I made a PR adding a new ORC data source into `sql/core`
module.
https://github.com/apache/spark/pull/17924 (+ 3,691 lines, -0)
Could you give some opinions on this approach?
Bests,
Dongjoon.