[ 
https://issues.apache.org/jira/browse/ORC-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

apache.org updated ORC-817:
---------------------------
    Description: 
This issue tracks the replacement of the {{aircompressor}} dependency for 
ZStandard compression with {{zstd-jni}}.

ORC's Java ZStandard compression codec currently uses the {{aircompressor}} 
dependency. This implementation is in pure Java, which provides all the 
niceties of not using an additional language, but over time, it has become less 
ideal:
 * Multiple other projects in the big data processing ecosystem like {{spark}}, 
{{parquet}}, and {{avro}}, all rely on {{zstd-jni}}, which is a Java Native 
Interface wrapper over the core {{zstd}} C++ library. Relying on the same 
dependency as other projects in our realm will let us track the same 
improvements and maintain the aesthetic of a ZStandard implementation blessed 
by the community.
 * ORC C++ uses the {{zstd}} library directly, while ORC Java relies on 
{{aircompressor}}. Since these versions do not have feature parity, it is 
theoretically possible to modify ORC C++ to produce a file that ORC Java cannot 
read. Maintaining compatibility between C++ and Java ORC means keeping the 
available features to those supported by both, which is limiting when relying 
on {{aircompressor}}. It is also conceivable that unintended incompatibilities 
between implementations could silently arise.
 * {{aircompressor}} implements a very limited set of ZStandard compression 
modes. In 
[https://github.com/airlift/aircompressor/blob/495bae80ac7487d2efa1bba437d04e8a2a42bb7b/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143]
 it can be seen that only the {{DoubleFastBlockCompressor}} strategy of 
ZStandard (out of the eight possible strategies) is actually implemented. This 
is a fast-speed/lower-compression-ratio strategy, which means it's suitable for 
things like shuffle data, but that that higher compression ratio/slower speed 
levels, which could be used to store "write-once-read-many" or backup data in 
ORC with high compression ratios, aren't possible with {{aircompressor}}.
 * {{aircompressor}} currently suffers from a bug, originally discovered in the 
{{presto}} community, that prevents ORC from upgrading to the most recent 
{{aircompressor}} version, lest we introduce the same bug into ORC: 
[https://github.com/airlift/aircompressor/issues/122] Moving to {{zstd-jni}} 
could let {{presto}} to move to {{zstd-jni}} as well.
 * Besides bug and performance fixes, {{zstd-jni}} supports newer functionality 
like {{–long}} mode that {{aircompressor}} doesn't. This mode uses longer 
distance windows to achieve materially higher compression ratios at the same 
speeds as earlier ZStandard versions, and has been available for more than two 
years: [https://github.com/facebook/zstd/releases/tag/v1.3.2] 

  was:
This issue tracks the replacement of the `aircompressor` dependency for 
ZStandard compression with `zstd-jni`.

ORC's Java ZStandard compression codec currently uses the `aircompressor` 
dependency. This implementation is in pure Java, which provides all the 
niceties of not using an additional language, but over time, it has become less 
ideal:
 * Multiple other projects in the big data processing ecosystem like `spark`, 
`parquet`, and `avro`, all rely on `zstd-jni`, which is a Java Native Interface 
wrapper over the core `zstd` C++ library. Relying on the same dependency as 
other projects in our realm will let us track the same improvements and 
maintain the aesthetic of a ZStandard implementation blessed by the community.
 * ORC C++ uses the `zstd` library directly, while ORC Java relies on 
`aircompressor`. Since these versions do not have feature parity, it is 
theoretically possible to modify ORC C++ to produce a file that ORC Java cannot 
read. Maintaining compatibility between C++ and Java ORC means keeping the 
available features to those supported by both, which is limiting when relying 
on `aircompressor`. It is also conceivable that unintended incompatibilities 
between implementations could silently arise.
 * `aircompressor` implements a very limited set of ZStandard compression 
modes. In 
[https://github.com/airlift/aircompressor/blob/495bae80ac7487d2efa1bba437d04e8a2a42bb7b/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143]
 it can be seen that only the `DoubleFastBlockCompressor` strategy of ZStandard 
(out of the eight possible strategies) is actually implemented. This is a 
fast-speed/lower-compression-ratio strategy, which means it's suitable for 
things like shuffle data, but that that higher compression ratio/slower speed 
levels, which could be used to store "write-once-read-many" or backup data in 
ORC with high compression ratios, aren't possible with `aircompressor`.
 * `aircompressor` currently suffers from a bug, originally discovered in the 
`presto` community, that prevents ORC from upgrading to the most recent 
`aircompressor` version, lest we introduce the same bug into ORC: 
[https://github.com/airlift/aircompressor/issues/122] Moving to `zstd-jni` 
could let `presto-orc` to move to `zstd-jni` as well.
 * Besides bug and performance fixes, `zstd-jni` supports newer functionality 
like `–long` mode that `aircompressor` doesn't. This mode uses longer distance 
windows to achieve materially higher compression ratios at the same speeds as 
earlier ZStandard versions, and has been available for more than two years: 
[https://github.com/facebook/zstd/releases/tag/v1.3.2] 


> Replace aircompressor ZStandard compression with zstd-jni
> ---------------------------------------------------------
>
>                 Key: ORC-817
>                 URL: https://issues.apache.org/jira/browse/ORC-817
>             Project: ORC
>          Issue Type: Improvement
>          Components: Java
>    Affects Versions: 1.7.0
>            Reporter: apache.org
>            Priority: Major
>             Fix For: 1.7.0
>
>
> This issue tracks the replacement of the {{aircompressor}} dependency for 
> ZStandard compression with {{zstd-jni}}.
> ORC's Java ZStandard compression codec currently uses the {{aircompressor}} 
> dependency. This implementation is in pure Java, which provides all the 
> niceties of not using an additional language, but over time, it has become 
> less ideal:
>  * Multiple other projects in the big data processing ecosystem like 
> {{spark}}, {{parquet}}, and {{avro}}, all rely on {{zstd-jni}}, which is a 
> Java Native Interface wrapper over the core {{zstd}} C++ library. Relying on 
> the same dependency as other projects in our realm will let us track the same 
> improvements and maintain the aesthetic of a ZStandard implementation blessed 
> by the community.
>  * ORC C++ uses the {{zstd}} library directly, while ORC Java relies on 
> {{aircompressor}}. Since these versions do not have feature parity, it is 
> theoretically possible to modify ORC C++ to produce a file that ORC Java 
> cannot read. Maintaining compatibility between C++ and Java ORC means keeping 
> the available features to those supported by both, which is limiting when 
> relying on {{aircompressor}}. It is also conceivable that unintended 
> incompatibilities between implementations could silently arise.
>  * {{aircompressor}} implements a very limited set of ZStandard compression 
> modes. In 
> [https://github.com/airlift/aircompressor/blob/495bae80ac7487d2efa1bba437d04e8a2a42bb7b/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143]
>  it can be seen that only the {{DoubleFastBlockCompressor}} strategy of 
> ZStandard (out of the eight possible strategies) is actually implemented. 
> This is a fast-speed/lower-compression-ratio strategy, which means it's 
> suitable for things like shuffle data, but that that higher compression 
> ratio/slower speed levels, which could be used to store 
> "write-once-read-many" or backup data in ORC with high compression ratios, 
> aren't possible with {{aircompressor}}.
>  * {{aircompressor}} currently suffers from a bug, originally discovered in 
> the {{presto}} community, that prevents ORC from upgrading to the most recent 
> {{aircompressor}} version, lest we introduce the same bug into ORC: 
> [https://github.com/airlift/aircompressor/issues/122] Moving to {{zstd-jni}} 
> could let {{presto}} to move to {{zstd-jni}} as well.
>  * Besides bug and performance fixes, {{zstd-jni}} supports newer 
> functionality like {{–long}} mode that {{aircompressor}} doesn't. This mode 
> uses longer distance windows to achieve materially higher compression ratios 
> at the same speeds as earlier ZStandard versions, and has been available for 
> more than two years: [https://github.com/facebook/zstd/releases/tag/v1.3.2] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to