Hello, I wanted to bring back this subject because I think we should
take action on this and at least first have a shaded version of guava.
I was playing with a toy project and I did the procedure we use to
submit jars to a Hadoop cluster via Flink/Spark which involves
creating an uber jar and I realized that the size of the jar was way
bigger than I expected, and the fact that we shade guava in every
module contributes to this. I found guava shaded on:

sdks/java/core
runners/core-construction-java
runners/core-java
model/job-management
runners/spark
sdks/java/io/hadoop-file-system
sdks/java/io/kafka

This means at least 6 times more of the size it should which counts in
around 15MB more (2.4MB*6 deps) of extra weight that we can simply
reduce by using a shaded version of the library.

I add this point to the previous ones mentioned during the discussion
because this goes against the end user experience and affects us all
(devs/users).

Another question is if we should shade (and how) protocol buffers
because now with the portability work we are exposing it closer to the
end users. I say this because I also found an issue while running a
job on YARN with the spark runner because hadoop-common includes
protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
a dependency to be able to use triggers (note the Spark runner
translates them using some method from runners/core-java). Since
hadoop-common is provided in the cluster with the older version of
protobuf, I am afraid that this will bite us in the future.

Ismaël

ps. There is already a JIRA for that shading for protobuf on
hadoop-common but this is not coming until version 3 is out.
https://issues.apache.org/jira/browse/HADOOP-13136

ps2. Extra curious situation is to see that the dataflow-runner ends
up having guava shaded twice via its shaded version on
core-construction-java.

ps3. Of course this message means a de-facto +1 at least to do it for
guava and evaluate it for other libs.


On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid> wrote:
> An issue to call out is how to deal with our generated code (.avro and
> .proto) as I don't believe those plugins allow you to generate code using a
> shaded package prefix on imports.
>
> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tg...@google.com.invalid>
> wrote:
>
>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>> every time for dependencies we know we'll use.
>>
>> This also means that if we end up pulling in transitive dependencies we
>> don't want in any particular module we can avoid having to adjust our
>> repackaging strategy for that module - which I have run into face-first in
>> the past.
>>
>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <k...@google.com.invalid>
>> wrote:
>>
>> > Hi all,
>> >
>> > Shading is a big part of how we keep our dependencies sane in Beam. But
>> > downsides: shading is super slow, causes massive jar bloat, and kind of
>> > hard to get right because artifacts and namespaces are not 1-to-1.
>> >
>> > I know that some communities distribute their own shaded distributions of
>> > dependencies. I had a thought about doing something similar that I wanted
>> > to throw out there for people to poke holes in.
>> >
>> > To set the scene, here is how I view shading:
>> >
>> >  - A module has public dependencies and private dependencies.
>> >  - Public deps are used for data interchange; users must share these
>> deps.
>> >  - Private deps are just functionality and can be hidden (in our case,
>> > relocated + bundled)
>> >  - It isn't necessarily that simple, because public and private deps
>> might
>> > interact in higher-order ways ("public" is contagious)
>> >
>> > Shading is an implementation detail of expressing these characteristics.
>> We
>> > use shading selectively because of its downsides I mentioned above.
>> >
>> > But what about this idea: Introduce shaded deps as a single separate
>> > artifact.
>> >
>> >  - sdks/java/private-deps: bundled uber jar with relocated versions of
>> > everything we want to shade
>> >
>> >  - sdks/java/core and sdks/java/harness: no relocation or bundling -
>> > depends on `beam-sdks-java-private-deps` and imports like
>> > `org.apache.beam.sdk.private.com.google.common` directly (this is what
>> > they
>> > are rewritten to
>> >
>> > Some benefits
>> >
>> >  - much faster builds of other modules
>> >  - only one shaded uber jar
>> >  - rare/no rebuilds of the uber jar
>> >  - can use maven enforcer to forbid imports like com.google.common
>> >  - configuration all in one place
>> >  - no automated rewriting of our real code, which has led to some major
>> > confusion
>> >  - easy to implement incrementally
>> >
>> > Downsides:
>> >
>> >  - plenty of effort work to get there
>> >  - unclear how many different such deps modules we need; sharing them
>> could
>> > get weird
>> >  - if we hit a roadblock, we will have committed a lot of time
>> >
>> > Just something I was musing as I spent another evening waiting for slow
>> > builds to try to confirm changes to brittle poms.
>> >
>> > Kenn
>> >
>>

Reply via email to