Hello, I wanted to bring back this subject because I think we should take action on this and at least first have a shaded version of guava. I was playing with a toy project and I did the procedure we use to submit jars to a Hadoop cluster via Flink/Spark which involves creating an uber jar and I realized that the size of the jar was way bigger than I expected, and the fact that we shade guava in every module contributes to this. I found guava shaded on:
sdks/java/core runners/core-construction-java runners/core-java model/job-management runners/spark sdks/java/io/hadoop-file-system sdks/java/io/kafka This means at least 6 times more of the size it should which counts in around 15MB more (2.4MB*6 deps) of extra weight that we can simply reduce by using a shaded version of the library. I add this point to the previous ones mentioned during the discussion because this goes against the end user experience and affects us all (devs/users). Another question is if we should shade (and how) protocol buffers because now with the portability work we are exposing it closer to the end users. I say this because I also found an issue while running a job on YARN with the spark runner because hadoop-common includes protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as a dependency to be able to use triggers (note the Spark runner translates them using some method from runners/core-java). Since hadoop-common is provided in the cluster with the older version of protobuf, I am afraid that this will bite us in the future. Ismaël ps. There is already a JIRA for that shading for protobuf on hadoop-common but this is not coming until version 3 is out. https://issues.apache.org/jira/browse/HADOOP-13136 ps2. Extra curious situation is to see that the dataflow-runner ends up having guava shaded twice via its shaded version on core-construction-java. ps3. Of course this message means a de-facto +1 at least to do it for guava and evaluate it for other libs. On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid> wrote: > An issue to call out is how to deal with our generated code (.avro and > .proto) as I don't believe those plugins allow you to generate code using a > shaded package prefix on imports. > > On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tg...@google.com.invalid> > wrote: > >> +1 to the goal. I'm hugely in favor of not doing the same shading work >> every time for dependencies we know we'll use. >> >> This also means that if we end up pulling in transitive dependencies we >> don't want in any particular module we can avoid having to adjust our >> repackaging strategy for that module - which I have run into face-first in >> the past. >> >> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <k...@google.com.invalid> >> wrote: >> >> > Hi all, >> > >> > Shading is a big part of how we keep our dependencies sane in Beam. But >> > downsides: shading is super slow, causes massive jar bloat, and kind of >> > hard to get right because artifacts and namespaces are not 1-to-1. >> > >> > I know that some communities distribute their own shaded distributions of >> > dependencies. I had a thought about doing something similar that I wanted >> > to throw out there for people to poke holes in. >> > >> > To set the scene, here is how I view shading: >> > >> > - A module has public dependencies and private dependencies. >> > - Public deps are used for data interchange; users must share these >> deps. >> > - Private deps are just functionality and can be hidden (in our case, >> > relocated + bundled) >> > - It isn't necessarily that simple, because public and private deps >> might >> > interact in higher-order ways ("public" is contagious) >> > >> > Shading is an implementation detail of expressing these characteristics. >> We >> > use shading selectively because of its downsides I mentioned above. >> > >> > But what about this idea: Introduce shaded deps as a single separate >> > artifact. >> > >> > - sdks/java/private-deps: bundled uber jar with relocated versions of >> > everything we want to shade >> > >> > - sdks/java/core and sdks/java/harness: no relocation or bundling - >> > depends on `beam-sdks-java-private-deps` and imports like >> > `org.apache.beam.sdk.private.com.google.common` directly (this is what >> > they >> > are rewritten to >> > >> > Some benefits >> > >> > - much faster builds of other modules >> > - only one shaded uber jar >> > - rare/no rebuilds of the uber jar >> > - can use maven enforcer to forbid imports like com.google.common >> > - configuration all in one place >> > - no automated rewriting of our real code, which has led to some major >> > confusion >> > - easy to implement incrementally >> > >> > Downsides: >> > >> > - plenty of effort work to get there >> > - unclear how many different such deps modules we need; sharing them >> could >> > get weird >> > - if we hit a roadblock, we will have committed a lot of time >> > >> > Just something I was musing as I spent another evening waiting for slow >> > builds to try to confirm changes to brittle poms. >> > >> > Kenn >> > >>