Re: Spark 3.0 and ORC 1.6

Dongjoon Hyun Wed, 29 Jan 2020 14:25:13 -0800

Hi, David.

Thank you for sharing your opinion.
I'm also a supporter for ZStandard.

Apache Spark 3.0 starts to take advantage of ZStd a lot.

   1) Switch the default codec for MapOutputStatus from GZip to ZStd.
   2) Add spark.eventLog.compression.codec to allow ZStd.
   3) Use Parquet+ZStd easily for data storage off the shelf.
       (with Hadoop 3.2 pre-built distribution)

So, the last big missing piece is ORC, right.

As a PMC member of both Apache Spark and Apache ORC,
I've been trying to reduce those gaps of both communities
in order to maximize the synergy.

Historically,

   1) At Apache Spark 2.3.0, Apache Spark started to depend on Apache ORC
1.4.0 (SPARK-21422)
   2) At Apache Spark 2.4.0, Apache ORC 1.5.2 becomes the default ORC
library. (SPARK-23456)
   3) At Apache Spark 3.0.0, Apache Spark embraced the breaking change of
Apach ORC 1.5.6 (SPARK-28208)
      And, Apache Spark 3.0.0-preview2 catched up until ORC 1.5.8 while
Spark 2.4.5 and 2.4.6 will stay with 1.5.5.

However, Apache ORC 1.5.9 RC0 also had another breaking change. Although we
minimized the regression at 1.5.9 RC1, we still need to adapt for the
followings. (Please see the ORC 1.5.9 RC0/RC1 vote email for the detail.)

   - Upgrade hive-storage-api upgrade from 2.6.0 to 2.7.1
   - Add new dependency `threeten-extra-1.5.0.jar`

The above breaking changes will be a subset of that of ORC 1.6.x. And, we
need to validate any potential performance regression at 1.6.x, too. I hope
we can use Apache ORC 1.5.9 as a stepping stone to reach Apache ORC 1.6.x

I'll create a PR to upgrade to Apache ORC 1.5.9 as soon as possible, but
Spark community will decide whether it will be in or not in 3.0.0.

In short, given the circumstance, Apache Spark 3.1.0 will be a more safer
release candidate for Apache ORC 1.6.x adoption. Spark 3.1.0 will arrive in
six month after Spark 3.0.0. We have a release cadence. At that time, we
can focus on ORC improvements with more references. (For now, Apache ORC
1.6 is not used at Apache Hive, either.)

Bests,
Dongjoon.

On Tue, Jan 28, 2020 at 12:41 PM David Christle
<dchris...@linkedin.com.invalid> wrote:

> Hi all,
>
>
>
> I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard
> compression option recently incorporated into ORC 1.6. I would love to
> explore using it for storing/querying of large (>10 TB) tables for my own
> disk I/O intensive workloads, and other users & companies may be interested
> in adopting ZStandard more broadly, since it seems to offer faster
> compression speeds at higher compression ratios with better multi-threaded
> support than zlib/Snappy. At scale, improvements of even ~10% on disk
> and/or compute, hopefully just from setting the “orc.compress” flag to a
> different value, could translate into palpable gains in capacity/cost
> cluster wide without requiring broad engineering migrations. See a somewhat
> recent FB Engineering blog post on the topic for their reported
> experiences: https://engineering.fb.com/core-data/zstandard/
>
>
>
> Do we know if ORC 1.6.x will make the cut for Spark 3.0?
>
>
>
> A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to
> 1.5.8, but I don’t have a good understanding of how difficult incorporating
> ORC 1.6.x into Spark will be. For instance, in the PRs for enabling Java
> Zstd in ORC (https://github.com/apache/orc/pull/306 &
> https://github.com/apache/orc/pull/412), some additional work/discussion
> around Hadoop shims occurred to maintain compatibility across different
> versions of Hadoop (e.g. 2.7) and aircompressor (a library containing Java
> implementations of various compression codecs, so that dependence on Hadoop
> 2.9 is not required). Again, these may be non-issues, but I wanted to
> kindle discussion around whether this can make the cut for 3.0, since I
> imagine it’s a major upgrade many users will focus on migrating to once
> released.
>
>
>
> Kind regards,
>
> David Christle
>

Re: Spark 3.0 and ORC 1.6

Reply via email to