Thanks for bringing that back.

Indeed guava is shaded in different uber-jar. Maybe we can have a common deps module that we include once (but the user will have to explicitly define the dep) ?

Basically, what do you propose for protobuf (unfortunately, I don't see an obvious) ?

Regards
JB

On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
Hello, I wanted to bring back this subject because I think we should
take action on this and at least first have a shaded version of guava.
I was playing with a toy project and I did the procedure we use to
submit jars to a Hadoop cluster via Flink/Spark which involves
creating an uber jar and I realized that the size of the jar was way
bigger than I expected, and the fact that we shade guava in every
module contributes to this. I found guava shaded on:

sdks/java/core
runners/core-construction-java
runners/core-java
model/job-management
runners/spark
sdks/java/io/hadoop-file-system
sdks/java/io/kafka

This means at least 6 times more of the size it should which counts in
around 15MB more (2.4MB*6 deps) of extra weight that we can simply
reduce by using a shaded version of the library.

I add this point to the previous ones mentioned during the discussion
because this goes against the end user experience and affects us all
(devs/users).

Another question is if we should shade (and how) protocol buffers
because now with the portability work we are exposing it closer to the
end users. I say this because I also found an issue while running a
job on YARN with the spark runner because hadoop-common includes
protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
a dependency to be able to use triggers (note the Spark runner
translates them using some method from runners/core-java). Since
hadoop-common is provided in the cluster with the older version of
protobuf, I am afraid that this will bite us in the future.

Ismaël

ps. There is already a JIRA for that shading for protobuf on
hadoop-common but this is not coming until version 3 is out.
https://issues.apache.org/jira/browse/HADOOP-13136

ps2. Extra curious situation is to see that the dataflow-runner ends
up having guava shaded twice via its shaded version on
core-construction-java.

ps3. Of course this message means a de-facto +1 at least to do it for
guava and evaluate it for other libs.


On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid> wrote:
An issue to call out is how to deal with our generated code (.avro and
.proto) as I don't believe those plugins allow you to generate code using a
shaded package prefix on imports.

On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tg...@google.com.invalid>
wrote:

+1 to the goal. I'm hugely in favor of not doing the same shading work
every time for dependencies we know we'll use.

This also means that if we end up pulling in transitive dependencies we
don't want in any particular module we can avoid having to adjust our
repackaging strategy for that module - which I have run into face-first in
the past.

On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <k...@google.com.invalid>
wrote:

Hi all,

Shading is a big part of how we keep our dependencies sane in Beam. But
downsides: shading is super slow, causes massive jar bloat, and kind of
hard to get right because artifacts and namespaces are not 1-to-1.

I know that some communities distribute their own shaded distributions of
dependencies. I had a thought about doing something similar that I wanted
to throw out there for people to poke holes in.

To set the scene, here is how I view shading:

  - A module has public dependencies and private dependencies.
  - Public deps are used for data interchange; users must share these
deps.
  - Private deps are just functionality and can be hidden (in our case,
relocated + bundled)
  - It isn't necessarily that simple, because public and private deps
might
interact in higher-order ways ("public" is contagious)

Shading is an implementation detail of expressing these characteristics.
We
use shading selectively because of its downsides I mentioned above.

But what about this idea: Introduce shaded deps as a single separate
artifact.

  - sdks/java/private-deps: bundled uber jar with relocated versions of
everything we want to shade

  - sdks/java/core and sdks/java/harness: no relocation or bundling -
depends on `beam-sdks-java-private-deps` and imports like
`org.apache.beam.sdk.private.com.google.common` directly (this is what
they
are rewritten to

Some benefits

  - much faster builds of other modules
  - only one shaded uber jar
  - rare/no rebuilds of the uber jar
  - can use maven enforcer to forbid imports like com.google.common
  - configuration all in one place
  - no automated rewriting of our real code, which has led to some major
confusion
  - easy to implement incrementally

Downsides:

  - plenty of effort work to get there
  - unclear how many different such deps modules we need; sharing them
could
get weird
  - if we hit a roadblock, we will have committed a lot of time

Just something I was musing as I spent another evening waiting for slow
builds to try to confirm changes to brittle poms.

Kenn



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to