Huge +1 to get rid of Guava!

This solves annoying dependency issues for some IOs and allow us to
get rid of the shading that makes current jars bigger than they should
be.

We can create our own 'micro guava' package with some classes for
things that are hard to migrate, or that we  prefer to still have like
the check* methods for example. Given the size of the task we should
probably divide it into subtasks, more important is to get rid of it
for 'sdks/java/core'. We can then attack other areas afterwards.

Other important idea would be to get rid of Protobuf in public APIs
like GCPIO and to better shade it from leaking into the runners. An
unexpected side effect of this is a leak of netty via gRPC/protobuf
that is byting us for the Spark runner, but well that's worth a
different discussion.


On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
<[email protected]> wrote:
> a map of list is fine and not a challenge we'll face long I hope ;)
>
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>
> 2018-02-01 9:40 GMT+01:00 Reuven Lax <[email protected]>:
>>
>> Not sure we'll be able to replace them all. Things like guava Table and
>> Multimap don't have great replacements in Java8.
>>
>> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>>
>>> +1, it was on my TODO for a while waiting the Java8 update.
>>>
>>> Regards
>>> JB
>>>
>>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>> > Why not dropping guava for all beam codebase? With java 8 it is quite
>>> > easy to do
>>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
>>> > good result.
>>> >
>>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <[email protected]
>>> > <mailto:[email protected]>> a écrit :
>>> >
>>> >     Make sure to include the guava version in the artifact name so that
>>> > we can
>>> >     have multiple vendored versions.
>>> >
>>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <[email protected]
>>> >     <mailto:[email protected]>> wrote:
>>> >
>>> >         I didn't have time for this, but it just bit me. We definitely
>>> > have
>>> >         Guava on the API surface of runner support code in ways that
>>> > get
>>> >         incompatibly shaded. I will probably start "1a" by making a
>>> > shaded
>>> >         library org.apache.beam:vendored-guava and starting to use it.
>>> > It sounds
>>> >         like there is generally unanimous support for that much,
>>> > anyhow.
>>> >
>>> >         Kenn
>>> >
>>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>> > <[email protected]
>>> >         <mailto:[email protected]>> wrote:
>>> >
>>> >             Thanks Ismaël for bringing up this discussion again!
>>> >
>>> >             I would be in favour of 1) and more specifically of 1a)
>>> >
>>> >             Aljoscha
>>> >
>>> >
>>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <[email protected]
>>> >>             <mailto:[email protected]>> wrote:
>>> >>
>>> >>             You can always run tests on post shaded artifacts instead
>>> >> of the
>>> >>             compiled classes, it just requires us to change our maven
>>> >> surefire
>>> >>             / gradle test configurations but it is true that most IDEs
>>> >> would
>>> >>             behave better with a dependency jar unless you delegate
>>> >> all the
>>> >>             build/test actions to the build system and then it won't
>>> >> matter.
>>> >>
>>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>> >> <[email protected]
>>> >>             <mailto:[email protected]>> wrote:
>>> >>
>>> >>                 There's also, with additional overhead,
>>> >>
>>> >>                 1a) A relocated and shipped package for each thing we
>>> >> want to
>>> >>                 relocate. I think this has also been tried outside
>>> >> Beam...
>>> >>
>>> >>                 Pros:
>>> >>                 * all the pros of 1) plus no bloat beyond what is
>>> >> necessary
>>> >>                 Cons:
>>> >>                 * abandons whitelist approach for public deps,
>>> >> reverting to
>>> >>                 blacklist approach for trouble things like guava, so a
>>> >> bit
>>> >>                 less principled
>>> >>
>>> >>                 For both 1) and 1a) I would add:
>>> >>
>>> >>                 Pros:
>>> >>                 * clearly readable dependency since code will `import
>>> >>                 org.apache.beam.private.guava21` and IDEs will
>>> >> understand it
>>> >>                 is a distinct lilbrary
>>> >>                 * can run tests on unpackaged classes, as long as deps
>>> >> are
>>> >>                 shaded or provided as jars
>>> >>                 * no mysterious action at a distance from inherited
>>> >> configuration
>>> >>                 Cons:
>>> >>                 * need to adjust imports
>>> >>
>>> >>                 Kenn
>>> >>
>>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>> >> <[email protected]
>>> >>                 <mailto:[email protected]>> wrote:
>>> >>
>>> >>                     I would suggest that either we use:
>>> >>                     1) A common deps package containing shaded
>>> >> dependencies
>>> >>                     allows for
>>> >>                     Pros
>>> >>                     * doesn't require the user to build an uber jar
>>> >>                     Risks
>>> >>                     * dependencies package will keep growing even if
>>> >> something
>>> >>                     is or isn't needed by all of Apache Beam leading
>>> >> to a
>>> >>                     large jar anyways negating any space savings
>>> >>
>>> >>                     2) Shade within each module to a common location
>>> >> like
>>> >>                     org.apache.beam.relocated.guava....
>>> >>                     Pros
>>> >>                     * you only get the shaded dependencies of the
>>> >> things that
>>> >>                     are required
>>> >>                     * its one less dependency for users to manage
>>> >>                     Risks
>>> >>                     * requires an uber jar to be built to get the
>>> >> space
>>> >>                     savings (either by a user or a distribution of
>>> >> Apache
>>> >>                     Beam) otherwise we negate any space savings.
>>> >>
>>> >>                     If we either use a common relocation scheme or a
>>> >>                     dependencies jar then each relocation should
>>> >> specifically
>>> >>                     contain the version number of the package because
>>> >> we would
>>> >>                     like to allow for us to be using two different
>>> >> versions of
>>> >>                     the same library.
>>> >>
>>> >>                     For the common deps package approach, should we
>>> >> check in
>>> >>                     code where the imports contain the relocated
>>> >> location
>>> >>                     (e.g. import org.apache.beam.guava.20.0.com
>>> >>
>>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.collect.ImmutableList)?
>>> >>
>>> >>
>>> >>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste
>>> >> Onofré
>>> >>                     <[email protected] <mailto:[email protected]>> wrote:
>>> >>
>>> >>                         Thanks for bringing that back.
>>> >>
>>> >>                         Indeed guava is shaded in different uber-jar.
>>> >> Maybe we
>>> >>                         can have a common deps module that we include
>>> >> once
>>> >>                         (but the user will have to explicitly define
>>> >> the dep) ?
>>> >>
>>> >>                         Basically, what do you propose for protobuf
>>> >>                         (unfortunately, I don't see an obvious) ?
>>> >>
>>> >>                         Regards
>>> >>                         JB
>>> >>
>>> >>
>>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>> >>
>>> >>                             Hello, I wanted to bring back this subject
>>> >> because
>>> >>                             I think we should
>>> >>                             take action on this and at least first
>>> >> have a
>>> >>                             shaded version of guava.
>>> >>                             I was playing with a toy project and I did
>>> >> the
>>> >>                             procedure we use to
>>> >>                             submit jars to a Hadoop cluster via
>>> >> Flink/Spark
>>> >>                             which involves
>>> >>                             creating an uber jar and I realized that
>>> >> the size
>>> >>                             of the jar was way
>>> >>                             bigger than I expected, and the fact that
>>> >> we shade
>>> >>                             guava in every
>>> >>                             module contributes to this. I found guava
>>> >> shaded on:
>>> >>
>>> >>                             sdks/java/core
>>> >>                             runners/core-construction-java
>>> >>                             runners/core-java
>>> >>                             model/job-management
>>> >>                             runners/spark
>>> >>                             sdks/java/io/hadoop-file-system
>>> >>                             sdks/java/io/kafka
>>> >>
>>> >>                             This means at least 6 times more of the
>>> >> size it
>>> >>                             should which counts in
>>> >>                             around 15MB more (2.4MB*6 deps) of extra
>>> >> weight
>>> >>                             that we can simply
>>> >>                             reduce by using a shaded version of the
>>> >> library.
>>> >>
>>> >>                             I add this point to the previous ones
>>> >> mentioned
>>> >>                             during the discussion
>>> >>                             because this goes against the end user
>>> >> experience
>>> >>                             and affects us all
>>> >>                             (devs/users).
>>> >>
>>> >>                             Another question is if we should shade
>>> >> (and how)
>>> >>                             protocol buffers
>>> >>                             because now with the portability work we
>>> >> are
>>> >>                             exposing it closer to the
>>> >>                             end users. I say this because I also found
>>> >> an
>>> >>                             issue while running a
>>> >>                             job on YARN with the spark runner because
>>> >>                             hadoop-common includes
>>> >>                             protobuf-java 2 and I had to explicitly
>>> >> provide
>>> >>                             protocol-buffers 3 as
>>> >>                             a dependency to be able to use triggers
>>> >> (note the
>>> >>                             Spark runner
>>> >>                             translates them using some method from
>>> >>                             runners/core-java). Since
>>> >>                             hadoop-common is provided in the cluster
>>> >> with the
>>> >>                             older version of
>>> >>                             protobuf, I am afraid that this will bite
>>> >> us in
>>> >>                             the future.
>>> >>
>>> >>                             Ismaël
>>> >>
>>> >>                             ps. There is already a JIRA for that
>>> >> shading for
>>> >>                             protobuf on
>>> >>                             hadoop-common but this is not coming until
>>> >> version
>>> >>                             3 is out.
>>> >>
>>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>> >>
>>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>> >>
>>> >>                             ps2. Extra curious situation is to see
>>> >> that the
>>> >>                             dataflow-runner ends
>>> >>                             up having guava shaded twice via its
>>> >> shaded version on
>>> >>                             core-construction-java.
>>> >>
>>> >>                             ps3. Of course this message means a
>>> >> de-facto +1 at
>>> >>                             least to do it for
>>> >>                             guava and evaluate it for other libs.
>>> >>
>>> >>
>>> >>                             On Tue, Oct 17, 2017 at 7:29 PM, Lukasz
>>> >> Cwik
>>> >>                             <[email protected]
>>> >>                             <mailto:[email protected]>> wrote:
>>> >>
>>> >>                                 An issue to call out is how to deal
>>> >> with our
>>> >>                                 generated code (.avro and
>>> >>                                 .proto) as I don't believe those
>>> >> plugins allow
>>> >>                                 you to generate code using a
>>> >>                                 shaded package prefix on imports.
>>> >>
>>> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
>>> >> Thomas Groh
>>> >>                                 <[email protected]
>>> >>                                 <mailto:[email protected]>>
>>> >>                                 wrote:
>>> >>
>>> >>                                     +1 to the goal. I'm hugely in
>>> >> favor of not
>>> >>                                     doing the same shading work
>>> >>                                     every time for dependencies we
>>> >> know we'll use.
>>> >>
>>> >>                                     This also means that if we end up
>>> >> pulling
>>> >>                                     in transitive dependencies we
>>> >>                                     don't want in any particular
>>> >> module we can
>>> >>                                     avoid having to adjust our
>>> >>                                     repackaging strategy for that
>>> >> module -
>>> >>                                     which I have run into face-first
>>> >> in
>>> >>                                     the past.
>>> >>
>>> >>                                     On Tue, Oct 17, 2017 at 9:48 AM,
>>> >> Kenneth
>>> >>                                     Knowles <[email protected]
>>> >>                                     <mailto:[email protected]>>
>>> >>                                     wrote:
>>> >>
>>> >>                                         Hi all,
>>> >>
>>> >>                                         Shading is a big part of how
>>> >> we keep
>>> >>                                         our dependencies sane in Beam.
>>> >> But
>>> >>                                         downsides: shading is super
>>> >> slow,
>>> >>                                         causes massive jar bloat, and
>>> >> kind of
>>> >>                                         hard to get right because
>>> >> artifacts
>>> >>                                         and namespaces are not 1-to-1.
>>> >>
>>> >>                                         I know that some communities
>>> >>                                         distribute their own shaded
>>> >>                                         distributions of
>>> >>                                         dependencies. I had a thought
>>> >> about
>>> >>                                         doing something similar that I
>>> >> wanted
>>> >>                                         to throw out there for people
>>> >> to poke
>>> >>                                         holes in.
>>> >>
>>> >>                                         To set the scene, here is how
>>> >> I view
>>> >>                                         shading:
>>> >>
>>> >>                                           - A module has public
>>> >> dependencies
>>> >>                                         and private dependencies.
>>> >>                                           - Public deps are used for
>>> >> data
>>> >>                                         interchange; users must share
>>> >> these
>>> >>
>>> >>                                     deps.
>>> >>
>>> >>                                           - Private deps are just
>>> >>                                         functionality and can be
>>> >> hidden (in
>>> >>                                         our case,
>>> >>                                         relocated + bundled)
>>> >>                                           - It isn't necessarily that
>>> >> simple,
>>> >>                                         because public and private
>>> >> deps
>>> >>
>>> >>                                     might
>>> >>
>>> >>                                         interact in higher-order ways
>>> >>                                         ("public" is contagious)
>>> >>
>>> >>                                         Shading is an implementation
>>> >> detail of
>>> >>                                         expressing these
>>> >> characteristics.
>>> >>
>>> >>                                     We
>>> >>
>>> >>                                         use shading selectively
>>> >> because of its
>>> >>                                         downsides I mentioned above.
>>> >>
>>> >>                                         But what about this idea:
>>> >> Introduce
>>> >>                                         shaded deps as a single
>>> >> separate
>>> >>                                         artifact.
>>> >>
>>> >>                                           - sdks/java/private-deps:
>>> >> bundled
>>> >>                                         uber jar with relocated
>>> >> versions of
>>> >>                                         everything we want to shade
>>> >>
>>> >>                                           - sdks/java/core and
>>> >>                                         sdks/java/harness: no
>>> >> relocation or
>>> >>                                         bundling -
>>> >>                                         depends on
>>> >>                                         `beam-sdks-java-private-deps`
>>> >> and
>>> >>                                         imports like
>>> >>
>>> >> `org.apache.beam.sdk.private.com.google.common`
>>> >>                                         directly (this is what
>>> >>                                         they
>>> >>                                         are rewritten to
>>> >>
>>> >>                                         Some benefits
>>> >>
>>> >>                                           - much faster builds of
>>> >> other modules
>>> >>                                           - only one shaded uber jar
>>> >>                                           - rare/no rebuilds of the
>>> >> uber jar
>>> >>                                           - can use maven enforcer to
>>> >> forbid
>>> >>                                         imports like com.google.common
>>> >>                                           - configuration all in one
>>> >> place
>>> >>                                           - no automated rewriting of
>>> >> our real
>>> >>                                         code, which has led to some
>>> >> major
>>> >>                                         confusion
>>> >>                                           - easy to implement
>>> >> incrementally
>>> >>
>>> >>                                         Downsides:
>>> >>
>>> >>                                           - plenty of effort work to
>>> >> get there
>>> >>                                           - unclear how many different
>>> >> such
>>> >>                                         deps modules we need; sharing
>>> >> them
>>> >>
>>> >>                                     could
>>> >>
>>> >>                                         get weird
>>> >>                                           - if we hit a roadblock, we
>>> >> will
>>> >>                                         have committed a lot of time
>>> >>
>>> >>                                         Just something I was musing as
>>> >> I spent
>>> >>                                         another evening waiting for
>>> >> slow
>>> >>                                         builds to try to confirm
>>> >> changes to
>>> >>                                         brittle poms.
>>> >>
>>> >>                                         Kenn
>>> >>
>>> >>
>>> >>
>>> >>                         --
>>> >>                         Jean-Baptiste Onofré
>>> >>                         [email protected]
>>> >> <mailto:[email protected]>
>>> >>                         http://blog.nanthrax.net
>>> >> <http://blog.nanthrax.net/>
>>> >>                         Talend - http://www.talend.com
>>> >> <http://www.talend.com/>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> [email protected]
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>
>>
>

Reply via email to