Spark 3.0 and ORC 1.6

David Christle Tue, 28 Jan 2020 12:41:23 -0800

Hi all,

I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard 
compression option recently incorporated into ORC 1.6. I would love to explore 
using it for storing/querying of large (>10 TB) tables for my own disk I/O 
intensive workloads, and other users & companies may be interested in adopting 
ZStandard more broadly, since it seems to offer faster compression speeds at 
higher compression ratios with better multi-threaded support than zlib/Snappy. 
At scale, improvements of even ~10% on disk and/or compute, hopefully just from 
setting the “orc.compress” flag to a different value, could translate into 
palpable gains in capacity/cost cluster wide without requiring broad 
engineering migrations. See a somewhat recent FB Engineering blog post on the 
topic for their reported experiences: 
https://engineering.fb.com/core-data/zstandard/


Do we know if ORC 1.6.x will make the cut for Spark 3.0?

A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to 1.5.8, 
but I don’t have a good understanding of how difficult incorporating ORC 1.6.x 
into Spark will be. For instance, in the PRs for enabling Java Zstd in ORC 
(https://github.com/apache/orc/pull/306 & 
https://github.com/apache/orc/pull/412), some additional work/discussion around 
Hadoop shims occurred to maintain compatibility across different versions of 
Hadoop (e.g. 2.7) and aircompressor (a library containing Java implementations 
of various compression codecs, so that dependence on Hadoop 2.9 is not 
required). Again, these may be non-issues, but I wanted to kindle discussion 
around whether this can make the cut for 3.0, since I imagine it’s a major 
upgrade many users will focus on migrating to once released.

Kind regards,
David Christle

Spark 3.0 and ORC 1.6

Reply via email to