Sorry for the long delay (was on vacation).

Using org.apache.hadoop isn't part of the Apache Beam Core module but is a
dependency for those who depend on the Apache Beam Hadoop module. So I
don't think swapping the com.facebook.presto.hadoop version for the
org.apache.hadoop version will address Ismael's concern about including
hadoop dependencies as part of the core.

I looked at shevek/lzo-java[1] and I think its our best choice since it is:
* pure Java
* GPLv3 (would require marking the dependency optional and telling users to
add it explicitly which we have done in the past as well)
* small (<100kb)
* small dependency tree (commons-logging & findbugs annotations if we only
depend on lzo-core)
* performance (github page claims 500mb/s compression and 800mb/s
decompression)

Alternatively we can make the LZO compression an extension module (with the
facebook dependency) and use a registrar to have it loaded dynamically.

1: https://github.com/shevek/lzo-java

On Fri, Dec 6, 2019 at 5:09 AM Amogh Tiwari <[email protected]> wrote:

> While studying the code, we found that the airlift/ aircompressor library
> only requires some classes which are also present in apache hadoop common
> package. Therefore, we are now thinking that if we make changes in the
> airlift/ aircompressor package, remove the
> com.facebook.presto.hadoop and use the existing org.apache.hadoop
> <https://mvnrepository.com/artifact/org.apache.hadoop> package which is
> already included in beam. This will solve both #2 and #3 as the transitive
> dependency will be removed and the size will also be reduced by almost
> ~20mbs.
>
> But if we use this approach, we will have to manually change the util
> whenever any changes are made to the airlift library.
>
> On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <[email protected]> wrote:
>
>> Going with the Registrar/ServiceLoader route would allow for alternative
>> providers for the same compression algorithms so if they don't like one
>> they can always contribute a different one.
>>
>> On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <[email protected]> wrote:
>>
>>> (1) seems not to be the issue because it is Apache licensed.
>>> (2) and (3) are the big issues, because it requires a provided huge uber
>>> jar that essentially leaks Hadoop classes into core SDK [1] so it is
>>> definitely concerning.
>>>
>>> We discussed at some point during the PR that added ZStandard support
>>> about creating some sort of Registrar for compression algorithms [2] but we
>>> decided to not go ahead because we could achieve that for the zstd case via
>>> the optional dependencies of commons-compress. Maybe it is time to
>>> reconsider if such mechanism is worth. For example for users that may not
>>> care about having the hadoop leakage to be able to use LZO.
>>>
>>> Refs.
>>> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
>>> [2] https://issues.apache.org/jira/browse/BEAM-6422
>>>
>>>
>>>
>>>
>>> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> Is there a way to wrap this up as an optional dependency with multiple
>>>> possible providers, if there's no good library satisfying all of the
>>>> conditions (in particular (1))?
>>>>
>>>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <[email protected]> wrote:
>>>> >
>>>> > I was hoping that someone in the community would provide some
>>>> alternatives since there are quite a few implementations.
>>>> >
>>>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> Hi Luke,
>>>> >>
>>>> >> I agree with your thoughts and observations. But,
>>>> airlift:aircompressor is the only implementation of LZO in pure java. That
>>>> straight away solves #5.
>>>> >> The other implementations that I found either have licensing issues
>>>> (since LZO natively uses GNU GPL licence) or are implemented using .c, .h
>>>> and jni (which again make them dependent on the OS). Please refer these:
>>>> twitter/hadoop-lzo and shevek/lzo-java.
>>>> >> These were the main reasons why we based this on
>>>> airlift:aircompressor.
>>>> >>
>>>> >> Thanks and Regards,
>>>> >> Amogh
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <[email protected]> wrote:
>>>> >>>
>>>> >>> I took a look. My biggest concern is finding a good LZO
>>>> implementation. Looking for one that preferably has:
>>>> >>> 1) Apache license
>>>> >>> 2) Has zero transitive dependencies
>>>> >>> 3) Is small
>>>> >>> 4) Is performant
>>>> >>> 5) Is native java or supports execution on the three main OSs
>>>> (Windows, Linux, Mac)
>>>> >>>
>>>> >>> In your PR you suggested using io.airlift:aircompressor:0.16 which
>>>> doesn't meet item #2 and its transitive dependency fails #3.
>>>> >>>
>>>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <[email protected]>
>>>> wrote:
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>> I have filed a PR for an extension that will enable Apache Beam to
>>>> work with LZO/LZOP compression. Please refer.
>>>> >>>> I would love it if someone can take this up and review it.
>>>> >>>> Please feel free to share your thoughts/suggestions.
>>>> >>>> Regards,
>>>> >>>> Amogh
>>>>
>>>

Reply via email to