[jira] [Commented] (HBASE-26258) Universal gzip, lz4, lzo, snappy, and zstd compression support via aircompressor

2021-09-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411654#comment-17411654
 ] 

Andrew Kyle Purtell commented on HBASE-26258:
-

For certain I will take measurements.

These pure Java implementations are meant to be fallbacks so even if they don’t 
perform as well it’s fine. That is probably already clear but let me say it to 
be sure. 

> Universal gzip, lz4, lzo, snappy, and zstd compression support via 
> aircompressor
> 
>
> Key: HBASE-26258
> URL: https://issues.apache.org/jira/browse/HBASE-26258
> Project: HBase
>  Issue Type: Improvement
>  Components: HFile, Operability
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> Some Hadoop compression codecs became more available in recent Hadoop 3.x 
> releases, addressed by HBASE-25940. This is nice but still requires native 
> platform support, which to state the obvious is not available on all 
> platforms and architectures, even if native libaries for some are bundled 
> into jars. 
> Airlift's aircompressor 
> (https://search.maven.org/artifact/io.airlift/aircompressor) is an Apache 2 
> licensed library, for Java 8 and up, available in Maven central, which 
> provides pure Java implementations of desirable compression algorithms gzip, 
> lz4, lzo, snappy, and zstd, and Hadoop compression codecs for same, claiming 
> "_they are typically 300% faster than the JNI wrappers_." 
> (https://github.com/airlift/aircompressor). This library is under active 
> development and has up to date releases because it is used by Trino.
> We have another project that depends on universal availability of SNAPPY. I 
> would like to make this change as a general improvement which also satisfies 
> that requirement. (The as yet unnamed project will be contributed later.) It 
> will be a very nice-to-have to have universal ZSTD support available as well. 
> Proposed changes:
> * Modify Compression.java such that compression codec implementation classes 
> can be specified by configuration. Currently they are hardcoded as strings. 
> * Pull in aircompressor as a 'compile' time dependency so it will be bundled 
> into our build and made available on the server classpath. 
> * Modify Compression.java to fall back to an aircompressor pure Java 
> implementation if schema specifies a compression algorithm, a Hadoop native 
> codec was specified as desired implementation, but the requisite native 
> support is somehow not available. 
> The combination of these changes will provide universal (pure Java) support 
> for these desired and desirable compression codecs while retaining default 
> behavior, which is to load and utilize Hadoop native implementations of same, 
> if native support is available. They will also let you override this default 
> if you wish to chase the claimed benefits of the pure Java alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26258) Universal gzip, lz4, lzo, snappy, and zstd compression support via aircompressor

2021-09-07 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411643#comment-17411643
 ] 

Duo Zhang commented on HBASE-26258:
---

Yes, I agree that after JIT the java code could have the same performance with 
native code. But usually, modern compressions will make use of new instructions 
from SSE and AVX, where we can not use them in java code.

Thank you for offering the measure work. Will wait for your report.

Thanks.

> Universal gzip, lz4, lzo, snappy, and zstd compression support via 
> aircompressor
> 
>
> Key: HBASE-26258
> URL: https://issues.apache.org/jira/browse/HBASE-26258
> Project: HBase
>  Issue Type: Improvement
>  Components: HFile, Operability
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> Some Hadoop compression codecs became more available in recent Hadoop 3.x 
> releases, addressed by HBASE-25940. This is nice but still requires native 
> platform support, which to state the obvious is not available on all 
> platforms and architectures, even if native libaries for some are bundled 
> into jars. 
> Airlift's aircompressor 
> (https://search.maven.org/artifact/io.airlift/aircompressor) is an Apache 2 
> licensed library, for Java 8 and up, available in Maven central, which 
> provides pure Java implementations of desirable compression algorithms gzip, 
> lz4, lzo, snappy, and zstd, and Hadoop compression codecs for same, claiming 
> "_they are typically 300% faster than the JNI wrappers_." 
> (https://github.com/airlift/aircompressor). This library is under active 
> development and has up to date releases because it is used by Trino.
> We have another project that depends on universal availability of SNAPPY. I 
> would like to make this change as a general improvement which also satisfies 
> that requirement. (The as yet unnamed project will be contributed later.) It 
> will be a very nice-to-have to have universal ZSTD support available as well. 
> Proposed changes:
> * Modify Compression.java such that compression codec implementation classes 
> can be specified by configuration. Currently they are hardcoded as strings. 
> * Pull in aircompressor as a 'compile' time dependency so it will be bundled 
> into our build and made available on the server classpath. 
> * Modify Compression.java to fall back to an aircompressor pure Java 
> implementation if schema specifies a compression algorithm, a Hadoop native 
> codec was specified as desired implementation, but the requisite native 
> support is somehow not available. 
> The combination of these changes will provide universal (pure Java) support 
> for these desired and desirable compression codecs while retaining default 
> behavior, which is to load and utilize Hadoop native implementations of same, 
> if native support is available. They will also let you override this default 
> if you wish to chase the claimed benefits of the pure Java alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26258) Universal gzip, lz4, lzo, snappy, and zstd compression support via aircompressor

2021-09-07 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411637#comment-17411637
 ] 

Andrew Kyle Purtell commented on HBASE-26258:
-

bq. Just curious how could a pure java compression implementation beat a C 
implementation? We lose a lot on the JNI wrapper?

Yes, JNI is terrible. 

Also in practice we know that Java code emitted by C2 can come very close to or 
meet native code performance. This happens for a few reasons reasons, two that 
come to mind right away: First, maybe the native code hasn't been optimized 
with vector instructions/assembly so Java C2 can match a C compiler 1:1 (or 
even improve, depending how smart it is about loop optimizations); and, second, 
because profiling during the interpreter or C1 compile phases can reveal 
optimization opportunities that a C/C++ compiler cannot assume, because it can 
count execution frequencies at method boundaries and loop headers.  

> Universal gzip, lz4, lzo, snappy, and zstd compression support via 
> aircompressor
> 
>
> Key: HBASE-26258
> URL: https://issues.apache.org/jira/browse/HBASE-26258
> Project: HBase
>  Issue Type: Improvement
>  Components: HFile, Operability
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> Some Hadoop compression codecs became more available in recent Hadoop 3.x 
> releases, addressed by HBASE-25940. This is nice but still requires native 
> platform support, which to state the obvious is not available on all 
> platforms and architectures, even if native libaries for some are bundled 
> into jars. 
> Airlift's aircompressor 
> (https://search.maven.org/artifact/io.airlift/aircompressor) is an Apache 2 
> licensed library, for Java 8 and up, available in Maven central, which 
> provides pure Java implementations of desirable compression algorithms gzip, 
> lz4, lzo, snappy, and zstd, and Hadoop compression codecs for same, claiming 
> "_they are typically 300% faster than the JNI wrappers_." 
> (https://github.com/airlift/aircompressor). This library is under active 
> development and has up to date releases because it is used by Trino.
> We have another project that depends on universal availability of SNAPPY. I 
> would like to make this change as a general improvement which also satisfies 
> that requirement. (The as yet unnamed project will be contributed later.) It 
> will be a very nice-to-have to have universal ZSTD support available as well. 
> Proposed changes:
> * Modify Compression.java such that compression codec implementation classes 
> can be specified by configuration. Currently they are hardcoded as strings. 
> * Pull in aircompressor as a 'compile' time dependency so it will be bundled 
> into our build and made available on the server classpath. 
> * Modify Compression.java to fall back to an aircompressor pure Java 
> implementation if schema specifies a compression algorithm, a Hadoop native 
> codec was specified as desired implementation, but the requisite native 
> support is somehow not available. 
> The combination of these changes will provide universal (pure Java) support 
> for these desired and desirable compression codecs while retaining default 
> behavior, which is to load and utilize Hadoop native implementations of same, 
> if native support is available. They will also let you override this default 
> if you wish to chase the claimed benefits of the pure Java alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-26258) Universal gzip, lz4, lzo, snappy, and zstd compression support via aircompressor

2021-09-07 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411628#comment-17411628
 ] 

Duo Zhang commented on HBASE-26258:
---

Just curious how could a pure java compression implementation beat a C 
implementation? We lose a lot on the JNI wrapper?

> Universal gzip, lz4, lzo, snappy, and zstd compression support via 
> aircompressor
> 
>
> Key: HBASE-26258
> URL: https://issues.apache.org/jira/browse/HBASE-26258
> Project: HBase
>  Issue Type: Improvement
>  Components: HFile, Operability
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> Some Hadoop compression codecs became more available in recent Hadoop 3.x 
> releases, addressed by HBASE-25940. This is nice but still requires native 
> platform support, which to state the obvious is not available on all 
> platforms and architectures, even if native libaries for some are bundled 
> into jars. 
> Airlift's aircompressor 
> (https://search.maven.org/artifact/io.airlift/aircompressor) is an Apache 2 
> licensed library, for Java 8 and up, available in Maven central, which 
> provides pure Java implementations of desirable compression algorithms gzip, 
> lz4, lzo, snappy, and zstd, and Hadoop compression codecs for same, claiming 
> "_they are typically 300% faster than the JNI wrappers_." 
> (https://github.com/airlift/aircompressor). This library is under active 
> development and has up to date releases because it is used by Trino.
> We have another project that depends on universal availability of SNAPPY. I 
> would like to make this change as a general improvement which also satisfies 
> that requirement. (The as yet unnamed project will be contributed later.) It 
> will be a very nice-to-have to have universal ZSTD support available as well. 
> Proposed changes:
> * Modify Compression.java such that compression codec implementation classes 
> can be specified by configuration. Currently they are hardcoded as strings. 
> * Pull in aircompressor as a 'compile' time dependency so it will be bundled 
> into our build and made available on the server classpath. 
> * Modify Compression.java to fall back to an aircompressor pure Java 
> implementation if schema specifies a compression algorithm, a Hadoop native 
> codec was specified as desired implementation, but the requisite native 
> support is somehow not available. 
> The combination of these changes will provide universal (pure Java) support 
> for these desired and desirable compression codecs while retaining default 
> behavior, which is to load and utilize Hadoop native implementations of same, 
> if native support is available. They will also let you override this default 
> if you wish to chase the claimed benefits of the pure Java alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)