Re: Faster Spark on ORC with Apache ORC

Dong Joon Hyun Sun, 14 May 2017 20:10:22 -0700

Hi, All.

As a continuation of SPARK-20682(Support a new faster ORC data source based on 
Apache ORC), I would like to suggest to make the default ORCFileFormat 
configurable between sql/hive and sql/core for the followings.


    spark.read.orc(...)
    spark.write.orc(...)

    CREATE TABLE t
    USING ORC
    ...

It's filed as SPARK-20728 and I made a PR for that, too.

In the new PR,

    - You can test not only the PR but also your apps more easily with that 
option.
    - To help reviews, the PR includes the updated benchmarks for both 
ORCReadBenchmark and ParquetReadBenchmark.

Since the previous PR is on-going, new PR inevitably have some of the previous 
PR.
I'll remove the duplication later in any ways.

Any opinions for Spark ORC improvement are welcome!

Thanks,
Dongjoon.?



________________________________
From: Dong Joon Hyun <dh...@hortonworks.com>
Sent: Friday, May 12, 2017 10:49 AM
To: dev@spark.apache.org
Subject: Re: Faster Spark on ORC with Apache ORC

Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    SQL Single Int Column Scan:         Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
    
-------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    215 /  262         73.0          
13.7       1.0X
    SQL Parquet MR                           1946 / 2083          8.1         
123.7       0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is 
improved much more in general.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                         102 /  123        153.7      
     6.5       1.0X
    SQL Parquet MR                                 409 /  436         38.5      
    26.0       0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the 
following.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    SQL ORC Vectorized                             147 /  153        107.3      
     9.3       1.0X
    SQL ORC MR                                     338 /  369         46.5      
    21.5       0.4X
    HIVE ORC MR                                    408 /  424         38.6      
    25.9       0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems 
to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun <dh...@hortonworks.com<mailto:dh...@hortonworks.com>>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive 
dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and 
get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which 
means full vectorization support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on 
ORC community effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some 
old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` 
module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.

Re: Faster Spark on ORC with Apache ORC

Reply via email to