Hi all, I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard compression option recently incorporated into ORC 1.6. I would love to explore using it for storing/querying of large (>10 TB) tables for my own disk I/O intensive workloads, and other users & companies may be interested in adopting ZStandard more broadly, since it seems to offer faster compression speeds at higher compression ratios with better multi-threaded support than zlib/Snappy. At scale, improvements of even ~10% on disk and/or compute, hopefully just from setting the “orc.compress” flag to a different value, could translate into palpable gains in capacity/cost cluster wide without requiring broad engineering migrations. See a somewhat recent FB Engineering blog post on the topic for their reported experiences: https://engineering.fb.com/core-data/zstandard/
Do we know if ORC 1.6.x will make the cut for Spark 3.0? A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to 1.5.8, but I don’t have a good understanding of how difficult incorporating ORC 1.6.x into Spark will be. For instance, in the PRs for enabling Java Zstd in ORC (https://github.com/apache/orc/pull/306 & https://github.com/apache/orc/pull/412), some additional work/discussion around Hadoop shims occurred to maintain compatibility across different versions of Hadoop (e.g. 2.7) and aircompressor (a library containing Java implementations of various compression codecs, so that dependence on Hadoop 2.9 is not required). Again, these may be non-issues, but I wanted to kindle discussion around whether this can make the cut for 3.0, since I imagine it’s a major upgrade many users will focus on migrating to once released. Kind regards, David Christle