Re: Use Apache ORC in Apache Spark 2.3

Reynold Xin Thu, 10 Aug 2017 15:24:25 -0700

Do you not use the catalog?


On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash <and...@andrewash.com> wrote:

> I would support moving ORC from sql/hive -> sql/core because it brings me
> one step closer to eliminating Hive from my Spark distribution by removing
> -Phive at build time.
>
> On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com>
> wrote:
>
>> Thank you again for coming and reviewing this PR.
>>
>>
>>
>> So far, we discussed the followings.
>>
>>
>>
>> 1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
>>
>>    - `sql/core` module gives more benefit than `sql/hive`.
>>
>>    - Apache ORC library (`no-hive` version) is a general and resonably
>> small library designed for non-hive apps.
>>
>>
>>
>> 2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
>>
>>    - The previous #17980 , #17924, and #17943 are the complete examples
>> containing this PR.
>>
>>    - This PR is focusing on dependency only.
>>
>>
>>
>> 3. `Why don't we then create a separate orc module? Just copy a few of
>> the files over?` (@rxin)
>>
>>    -  Apache ORC library is the same with most of other data sources(CSV,
>> JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
>>
>>    - It's better to use as a library instead of copying ORC files because
>> Apache ORC shaded jar has many files. We had better depend on Apache ORC
>> community's effort until an unavoidable reason for copying occurs.
>>
>>
>>
>> 4. `I do worry in the future whether ORC would bring in a lot more jars`
>> (@rxin)
>>
>>    - The ORC core library's dependency tree is aggressively kept as small
>> as possible. I've gone through and excluded unnecessary jars from our
>> dependencies. I also kick back pull requests that add unnecessary new
>> dependencies. (@omalley)
>>
>>
>>
>> 5. `In the long term, Spark should move to using only the vectorized
>> reader in ORC's core” (@omalley)
>>
>> - Of course.
>>
>>
>>
>> I’ve been waiting for new comments and discussion since last week.
>>
>> Apparently, there is no further comments except the last comment(5) from
>> Owen in this week.
>>
>>
>>
>> Please give your opinion if you think we need some change on the current
>> PR (as-is).
>>
>> FYI, there is one LGTM on the PR (as-is) and no -1 so far.
>>
>>
>>
>> Thank you again for supporting new ORC improvement in Apache Spark.
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>>
>>
>> *From: *Dong Joon Hyun <dh...@hortonworks.com>
>> *Date: *Friday, August 4, 2017 at 8:05 AM
>> *To: *"dev@spark.apache.org" <dev@spark.apache.org>
>> *Cc: *Apache Spark PMC <priv...@spark.apache.org>
>> *Subject: *Use Apache ORC in Apache Spark 2.3
>>
>>
>>
>> Hi, All.
>>
>>
>>
>> Apache Spark always has been a fast and general engine, and
>>
>> supports Apache ORC inside `sql/hive` module with Hive dependency since
>> Spark 1.4.X (SPARK-2883).
>>
>> However, there are many open issues about `Feature parity for ORC with
>> Parquet (SPARK-20901)` as of today.
>>
>>
>>
>> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
>> the following benefits.
>>
>>
>>
>>     - Usability:
>>
>>         * Users can use `ORC` data sources without hive module (-Phive)
>> like `Parquet` format.
>>
>>
>>
>>     - Stability & Maintanability:
>>
>>         * ORC 1.4 already has many fixes.
>>
>>         * In the future, Spark can upgrade ORC library independently from
>> Hive
>>            (similar to Parquet library, too)
>>
>>         * Eventually, reduce the dependecy on old Hive 1.2.1.
>>
>>
>>
>>     - Speed:
>>
>>         * Last but not least, Spark can use both Spark `ColumnarBatch`
>> and ORC `RowBatch` together
>>
>>           which means full vectorization support.
>>
>>
>>
>> First of all, I'd love to improve Apache Spark in the following steps in
>> the time frame of Spark 2.3.
>>
>>
>>
>>     - SPARK-21422: Depend on Apache ORC 1.4.0
>>
>>     - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>>
>>     - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
>> sql/core
>>
>>     - SPARK-16060: Vectorized Orc Reader
>>
>>
>>
>> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>>
>> but the PRs seems to need more attention of PMC since this is an
>> important change.
>>
>> Since the discussion on Apache Spark 2.3 cadence is already started this
>> week,
>>
>> I thought it’s a best time to ask you about this.
>>
>>
>>
>> Could anyone of you help me to proceed ORC improvement in Apache Spark
>> community?
>>
>>
>>
>> Please visit the minimal PR and JIRA issue as a starter.
>>
>>
>>
>>    - https://github.com/apache/spark/pull/18640
>>    - https://issues.apache.org/jira/browse/SPARK-21422
>>
>>
>>
>> Thank you in advance.
>>
>>
>>
>> Bests,
>>
>> Dongjoon Hyun.
>>
>
>

Re: Use Apache ORC in Apache Spark 2.3

Reply via email to