[
https://issues.apache.org/jira/browse/ORC-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated ORC-817:
------------------------------
Affects Version/s: 1.8.0
(was: 1.7.0)
> Replace aircompressor ZStandard compression with zstd-jni
> ---------------------------------------------------------
>
> Key: ORC-817
> URL: https://issues.apache.org/jira/browse/ORC-817
> Project: ORC
> Issue Type: Improvement
> Components: Java
> Affects Versions: 1.8.0
> Reporter: David Christle
> Priority: Major
>
> This issue tracks the replacement of the {{aircompressor}} dependency for
> ZStandard compression with {{zstd-jni}}.
> ORC's Java ZStandard compression codec currently uses the {{aircompressor}}
> dependency. This implementation is in pure Java, which provides all the
> niceties of not using an additional language, but over time, it has become
> less ideal:
> * Multiple other projects in the big data processing ecosystem like
> {{spark}}, {{parquet}}, and {{avro}}, all rely on {{zstd-jni}}, which is a
> Java Native Interface wrapper over the core {{zstd}} C++ library. Relying on
> the same dependency as other projects in our realm will let us track the same
> improvements and maintain the aesthetic of a ZStandard implementation blessed
> by the community.
> * ORC C++ uses the {{zstd}} library directly, while ORC Java relies on
> {{aircompressor}}. Since these versions do not have feature parity, it is
> theoretically possible to modify ORC C++ to produce a file that ORC Java
> cannot read. Maintaining compatibility between C++ and Java ORC means keeping
> the available features to those supported by both, which is limiting when
> relying on {{aircompressor}}. It is also conceivable that unintended
> incompatibilities between implementations could silently arise.
> * {{aircompressor}} implements a very limited set of ZStandard compression
> modes. In
> [https://github.com/airlift/aircompressor/blob/495bae80ac7487d2efa1bba437d04e8a2a42bb7b/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143]
> it can be seen that only the {{DoubleFastBlockCompressor}} strategy of
> ZStandard (out of the eight possible strategies) is actually implemented.
> This is a fast-speed/lower-compression-ratio strategy, which means it's
> suitable for things like shuffle data, but that that higher compression
> ratio/slower speed levels, which could be used to store
> "write-once-read-many" or backup data in ORC with high compression ratios,
> aren't possible with {{aircompressor}}.
> * {{aircompressor}} currently suffers from a bug, originally discovered in
> the {{presto}} community, that prevents ORC from upgrading to the most recent
> {{aircompressor}} version, lest we introduce the same bug into ORC:
> [https://github.com/airlift/aircompressor/issues/122] Moving to {{zstd-jni}}
> could let {{presto}} to move to {{zstd-jni}} as well.
> * Besides bug and performance fixes, {{zstd-jni}} supports newer
> functionality like {{–long}} mode that {{aircompressor}} doesn't. This mode
> uses longer distance windows to achieve materially higher compression ratios
> at the same speeds as earlier ZStandard versions, and has been available for
> more than two years: [https://github.com/facebook/zstd/releases/tag/v1.3.2]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)