While studying the code, we found that the airlift/ aircompressor library
only requires some classes which are also present in apache hadoop common
package. Therefore, we are now thinking that if we make changes in the
airlift/ aircompressor package, remove the
com.facebook.presto.hadoop and use the existing org.apache.hadoop
<https://mvnrepository.com/artifact/org.apache.hadoop> package which is
already included in beam. This will solve both #2 and #3 as the transitive
dependency will be removed and the size will also be reduced by almost
~20mbs.

But if we use this approach, we will have to manually change the util
whenever any changes are made to the airlift library.

On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <[email protected]> wrote:

> Going with the Registrar/ServiceLoader route would allow for alternative
> providers for the same compression algorithms so if they don't like one
> they can always contribute a different one.
>
> On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <[email protected]> wrote:
>
>> (1) seems not to be the issue because it is Apache licensed.
>> (2) and (3) are the big issues, because it requires a provided huge uber
>> jar that essentially leaks Hadoop classes into core SDK [1] so it is
>> definitely concerning.
>>
>> We discussed at some point during the PR that added ZStandard support
>> about creating some sort of Registrar for compression algorithms [2] but we
>> decided to not go ahead because we could achieve that for the zstd case via
>> the optional dependencies of commons-compress. Maybe it is time to
>> reconsider if such mechanism is worth. For example for users that may not
>> care about having the hadoop leakage to be able to use LZO.
>>
>> Refs.
>> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
>> [2] https://issues.apache.org/jira/browse/BEAM-6422
>>
>>
>>
>>
>> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> Is there a way to wrap this up as an optional dependency with multiple
>>> possible providers, if there's no good library satisfying all of the
>>> conditions (in particular (1))?
>>>
>>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <[email protected]> wrote:
>>> >
>>> > I was hoping that someone in the community would provide some
>>> alternatives since there are quite a few implementations.
>>> >
>>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <[email protected]> wrote:
>>> >>
>>> >> Hi Luke,
>>> >>
>>> >> I agree with your thoughts and observations. But,
>>> airlift:aircompressor is the only implementation of LZO in pure java. That
>>> straight away solves #5.
>>> >> The other implementations that I found either have licensing issues
>>> (since LZO natively uses GNU GPL licence) or are implemented using .c, .h
>>> and jni (which again make them dependent on the OS). Please refer these:
>>> twitter/hadoop-lzo and shevek/lzo-java.
>>> >> These were the main reasons why we based this on
>>> airlift:aircompressor.
>>> >>
>>> >> Thanks and Regards,
>>> >> Amogh
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <[email protected]> wrote:
>>> >>>
>>> >>> I took a look. My biggest concern is finding a good LZO
>>> implementation. Looking for one that preferably has:
>>> >>> 1) Apache license
>>> >>> 2) Has zero transitive dependencies
>>> >>> 3) Is small
>>> >>> 4) Is performant
>>> >>> 5) Is native java or supports execution on the three main OSs
>>> (Windows, Linux, Mac)
>>> >>>
>>> >>> In your PR you suggested using io.airlift:aircompressor:0.16 which
>>> doesn't meet item #2 and its transitive dependency fails #3.
>>> >>>
>>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>> I have filed a PR for an extension that will enable Apache Beam to
>>> work with LZO/LZOP compression. Please refer.
>>> >>>> I would love it if someone can take this up and review it.
>>> >>>> Please feel free to share your thoughts/suggestions.
>>> >>>> Regards,
>>> >>>> Amogh
>>>
>>

Reply via email to