While studying the code, we found that the airlift/ aircompressor library only requires some classes which are also present in apache hadoop common package. Therefore, we are now thinking that if we make changes in the airlift/ aircompressor package, remove the com.facebook.presto.hadoop and use the existing org.apache.hadoop <https://mvnrepository.com/artifact/org.apache.hadoop> package which is already included in beam. This will solve both #2 and #3 as the transitive dependency will be removed and the size will also be reduced by almost ~20mbs.
But if we use this approach, we will have to manually change the util whenever any changes are made to the airlift library. On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <[email protected]> wrote: > Going with the Registrar/ServiceLoader route would allow for alternative > providers for the same compression algorithms so if they don't like one > they can always contribute a different one. > > On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <[email protected]> wrote: > >> (1) seems not to be the issue because it is Apache licensed. >> (2) and (3) are the big issues, because it requires a provided huge uber >> jar that essentially leaks Hadoop classes into core SDK [1] so it is >> definitely concerning. >> >> We discussed at some point during the PR that added ZStandard support >> about creating some sort of Registrar for compression algorithms [2] but we >> decided to not go ahead because we could achieve that for the zstd case via >> the optional dependencies of commons-compress. Maybe it is time to >> reconsider if such mechanism is worth. For example for users that may not >> care about having the hadoop leakage to be able to use LZO. >> >> Refs. >> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16 >> [2] https://issues.apache.org/jira/browse/BEAM-6422 >> >> >> >> >> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <[email protected]> >> wrote: >> >>> Is there a way to wrap this up as an optional dependency with multiple >>> possible providers, if there's no good library satisfying all of the >>> conditions (in particular (1))? >>> >>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <[email protected]> wrote: >>> > >>> > I was hoping that someone in the community would provide some >>> alternatives since there are quite a few implementations. >>> > >>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <[email protected]> wrote: >>> >> >>> >> Hi Luke, >>> >> >>> >> I agree with your thoughts and observations. But, >>> airlift:aircompressor is the only implementation of LZO in pure java. That >>> straight away solves #5. >>> >> The other implementations that I found either have licensing issues >>> (since LZO natively uses GNU GPL licence) or are implemented using .c, .h >>> and jni (which again make them dependent on the OS). Please refer these: >>> twitter/hadoop-lzo and shevek/lzo-java. >>> >> These were the main reasons why we based this on >>> airlift:aircompressor. >>> >> >>> >> Thanks and Regards, >>> >> Amogh >>> >> >>> >> >>> >> >>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <[email protected]> wrote: >>> >>> >>> >>> I took a look. My biggest concern is finding a good LZO >>> implementation. Looking for one that preferably has: >>> >>> 1) Apache license >>> >>> 2) Has zero transitive dependencies >>> >>> 3) Is small >>> >>> 4) Is performant >>> >>> 5) Is native java or supports execution on the three main OSs >>> (Windows, Linux, Mac) >>> >>> >>> >>> In your PR you suggested using io.airlift:aircompressor:0.16 which >>> doesn't meet item #2 and its transitive dependency fails #3. >>> >>> >>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <[email protected]> >>> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> I have filed a PR for an extension that will enable Apache Beam to >>> work with LZO/LZOP compression. Please refer. >>> >>>> I would love it if someone can take this up and review it. >>> >>>> Please feel free to share your thoughts/suggestions. >>> >>>> Regards, >>> >>>> Amogh >>> >>
