Hi Luke,

We have gone through shevek/lzo-java, but we chose to go with
airflow/aircompressor for the following reasons:

1) shevek/lzo-java is internally using .jni, .c and .h files, hence the GNU
licence, and that would leave us with only choice of putting this as an
option dependency

2) performance of airlift/aircompressor was much better as compared to
shevek/lzo-java in terms of compression ratio and time taken for
compression/decompression

3) airflow/aircompressor is in pure java and is under Apache licence

Therefore, we'd prefer to go with adding the current implementation as
optional. We'd require your inputs on the same as we are unsure on where we
are supposed to keep the required files and how the final directory
structure would like. We have an idea and we'll update the current PR
accordingly.

Please do guide us on this.


Regards,

Amogh Tiwari

On Wed, Dec 18, 2019 at 4:42 AM Luke Cwik <lc...@google.com> wrote:

> Sorry for the long delay (was on vacation).
>
> Using org.apache.hadoop isn't part of the Apache Beam Core module but is a
> dependency for those who depend on the Apache Beam Hadoop module. So I
> don't think swapping the com.facebook.presto.hadoop version for the
> org.apache.hadoop version will address Ismael's concern about including
> hadoop dependencies as part of the core.
>
> I looked at shevek/lzo-java[1] and I think its our best choice since it is:
> * pure Java
> * GPLv3 (would require marking the dependency optional and telling users
> to add it explicitly which we have done in the past as well)
> * small (<100kb)
> * small dependency tree (commons-logging & findbugs annotations if we only
> depend on lzo-core)
> * performance (github page claims 500mb/s compression and 800mb/s
> decompression)
>
> Alternatively we can make the LZO compression an extension module (with
> the facebook dependency) and use a registrar to have it loaded dynamically.
>
> 1: https://github.com/shevek/lzo-java
>
> On Fri, Dec 6, 2019 at 5:09 AM Amogh Tiwari <amo...@google.com> wrote:
>
>> While studying the code, we found that the airlift/ aircompressor library
>> only requires some classes which are also present in apache hadoop common
>> package. Therefore, we are now thinking that if we make changes in the
>> airlift/ aircompressor package, remove the
>> com.facebook.presto.hadoop and use the existing org.apache.hadoop
>> <https://mvnrepository.com/artifact/org.apache.hadoop> package which is
>> already included in beam. This will solve both #2 and #3 as the transitive
>> dependency will be removed and the size will also be reduced by almost
>> ~20mbs.
>>
>> But if we use this approach, we will have to manually change the util
>> whenever any changes are made to the airlift library.
>>
>> On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> Going with the Registrar/ServiceLoader route would allow for alternative
>>> providers for the same compression algorithms so if they don't like one
>>> they can always contribute a different one.
>>>
>>> On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>
>>>> (1) seems not to be the issue because it is Apache licensed.
>>>> (2) and (3) are the big issues, because it requires a provided huge
>>>> uber jar that essentially leaks Hadoop classes into core SDK [1] so it is
>>>> definitely concerning.
>>>>
>>>> We discussed at some point during the PR that added ZStandard support
>>>> about creating some sort of Registrar for compression algorithms [2] but we
>>>> decided to not go ahead because we could achieve that for the zstd case via
>>>> the optional dependencies of commons-compress. Maybe it is time to
>>>> reconsider if such mechanism is worth. For example for users that may not
>>>> care about having the hadoop leakage to be able to use LZO.
>>>>
>>>> Refs.
>>>> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
>>>> [2] https://issues.apache.org/jira/browse/BEAM-6422
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> Is there a way to wrap this up as an optional dependency with multiple
>>>>> possible providers, if there's no good library satisfying all of the
>>>>> conditions (in particular (1))?
>>>>>
>>>>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <lc...@google.com> wrote:
>>>>> >
>>>>> > I was hoping that someone in the community would provide some
>>>>> alternatives since there are quite a few implementations.
>>>>> >
>>>>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <amo...@google.com>
>>>>> wrote:
>>>>> >>
>>>>> >> Hi Luke,
>>>>> >>
>>>>> >> I agree with your thoughts and observations. But,
>>>>> airlift:aircompressor is the only implementation of LZO in pure java. That
>>>>> straight away solves #5.
>>>>> >> The other implementations that I found either have licensing issues
>>>>> (since LZO natively uses GNU GPL licence) or are implemented using .c, .h
>>>>> and jni (which again make them dependent on the OS). Please refer these:
>>>>> twitter/hadoop-lzo and shevek/lzo-java.
>>>>> >> These were the main reasons why we based this on
>>>>> airlift:aircompressor.
>>>>> >>
>>>>> >> Thanks and Regards,
>>>>> >> Amogh
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <lc...@google.com> wrote:
>>>>> >>>
>>>>> >>> I took a look. My biggest concern is finding a good LZO
>>>>> implementation. Looking for one that preferably has:
>>>>> >>> 1) Apache license
>>>>> >>> 2) Has zero transitive dependencies
>>>>> >>> 3) Is small
>>>>> >>> 4) Is performant
>>>>> >>> 5) Is native java or supports execution on the three main OSs
>>>>> (Windows, Linux, Mac)
>>>>> >>>
>>>>> >>> In your PR you suggested using io.airlift:aircompressor:0.16 which
>>>>> doesn't meet item #2 and its transitive dependency fails #3.
>>>>> >>>
>>>>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <amo...@google.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi,
>>>>> >>>> I have filed a PR for an extension that will enable Apache Beam
>>>>> to work with LZO/LZOP compression. Please refer.
>>>>> >>>> I would love it if someone can take this up and review it.
>>>>> >>>> Please feel free to share your thoughts/suggestions.
>>>>> >>>> Regards,
>>>>> >>>> Amogh
>>>>>
>>>>

Reply via email to