I would suggest that either we use: 1) A common deps package containing shaded dependencies allows for Pros * doesn't require the user to build an uber jar Risks * dependencies package will keep growing even if something is or isn't needed by all of Apache Beam leading to a large jar anyways negating any space savings
2) Shade within each module to a common location like org.apache.beam.relocated.guava.... Pros * you only get the shaded dependencies of the things that are required * its one less dependency for users to manage Risks * requires an uber jar to be built to get the space savings (either by a user or a distribution of Apache Beam) otherwise we negate any space savings. If we either use a common relocation scheme or a dependencies jar then each relocation should specifically contain the version number of the package because we would like to allow for us to be using two different versions of the same library. For the common deps package approach, should we check in code where the imports contain the relocated location (e.g. import org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)? On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <[email protected]> wrote: > Thanks for bringing that back. > > Indeed guava is shaded in different uber-jar. Maybe we can have a common > deps module that we include once (but the user will have to explicitly > define the dep) ? > > Basically, what do you propose for protobuf (unfortunately, I don't see an > obvious) ? > > Regards > JB > > > On 12/11/2017 05:35 PM, Ismaël Mejía wrote: > >> Hello, I wanted to bring back this subject because I think we should >> take action on this and at least first have a shaded version of guava. >> I was playing with a toy project and I did the procedure we use to >> submit jars to a Hadoop cluster via Flink/Spark which involves >> creating an uber jar and I realized that the size of the jar was way >> bigger than I expected, and the fact that we shade guava in every >> module contributes to this. I found guava shaded on: >> >> sdks/java/core >> runners/core-construction-java >> runners/core-java >> model/job-management >> runners/spark >> sdks/java/io/hadoop-file-system >> sdks/java/io/kafka >> >> This means at least 6 times more of the size it should which counts in >> around 15MB more (2.4MB*6 deps) of extra weight that we can simply >> reduce by using a shaded version of the library. >> >> I add this point to the previous ones mentioned during the discussion >> because this goes against the end user experience and affects us all >> (devs/users). >> >> Another question is if we should shade (and how) protocol buffers >> because now with the portability work we are exposing it closer to the >> end users. I say this because I also found an issue while running a >> job on YARN with the spark runner because hadoop-common includes >> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as >> a dependency to be able to use triggers (note the Spark runner >> translates them using some method from runners/core-java). Since >> hadoop-common is provided in the cluster with the older version of >> protobuf, I am afraid that this will bite us in the future. >> >> Ismaël >> >> ps. There is already a JIRA for that shading for protobuf on >> hadoop-common but this is not coming until version 3 is out. >> https://issues.apache.org/jira/browse/HADOOP-13136 >> >> ps2. Extra curious situation is to see that the dataflow-runner ends >> up having guava shaded twice via its shaded version on >> core-construction-java. >> >> ps3. Of course this message means a de-facto +1 at least to do it for >> guava and evaluate it for other libs. >> >> >> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <[email protected]> >> wrote: >> >>> An issue to call out is how to deal with our generated code (.avro and >>> .proto) as I don't believe those plugins allow you to generate code >>> using a >>> shaded package prefix on imports. >>> >>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <[email protected]> >>> wrote: >>> >>> +1 to the goal. I'm hugely in favor of not doing the same shading work >>>> every time for dependencies we know we'll use. >>>> >>>> This also means that if we end up pulling in transitive dependencies we >>>> don't want in any particular module we can avoid having to adjust our >>>> repackaging strategy for that module - which I have run into face-first >>>> in >>>> the past. >>>> >>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <[email protected] >>>> > >>>> wrote: >>>> >>>> Hi all, >>>>> >>>>> Shading is a big part of how we keep our dependencies sane in Beam. But >>>>> downsides: shading is super slow, causes massive jar bloat, and kind of >>>>> hard to get right because artifacts and namespaces are not 1-to-1. >>>>> >>>>> I know that some communities distribute their own shaded distributions >>>>> of >>>>> dependencies. I had a thought about doing something similar that I >>>>> wanted >>>>> to throw out there for people to poke holes in. >>>>> >>>>> To set the scene, here is how I view shading: >>>>> >>>>> - A module has public dependencies and private dependencies. >>>>> - Public deps are used for data interchange; users must share these >>>>> >>>> deps. >>>> >>>>> - Private deps are just functionality and can be hidden (in our case, >>>>> relocated + bundled) >>>>> - It isn't necessarily that simple, because public and private deps >>>>> >>>> might >>>> >>>>> interact in higher-order ways ("public" is contagious) >>>>> >>>>> Shading is an implementation detail of expressing these >>>>> characteristics. >>>>> >>>> We >>>> >>>>> use shading selectively because of its downsides I mentioned above. >>>>> >>>>> But what about this idea: Introduce shaded deps as a single separate >>>>> artifact. >>>>> >>>>> - sdks/java/private-deps: bundled uber jar with relocated versions of >>>>> everything we want to shade >>>>> >>>>> - sdks/java/core and sdks/java/harness: no relocation or bundling - >>>>> depends on `beam-sdks-java-private-deps` and imports like >>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is what >>>>> they >>>>> are rewritten to >>>>> >>>>> Some benefits >>>>> >>>>> - much faster builds of other modules >>>>> - only one shaded uber jar >>>>> - rare/no rebuilds of the uber jar >>>>> - can use maven enforcer to forbid imports like com.google.common >>>>> - configuration all in one place >>>>> - no automated rewriting of our real code, which has led to some >>>>> major >>>>> confusion >>>>> - easy to implement incrementally >>>>> >>>>> Downsides: >>>>> >>>>> - plenty of effort work to get there >>>>> - unclear how many different such deps modules we need; sharing them >>>>> >>>> could >>>> >>>>> get weird >>>>> - if we hit a roadblock, we will have committed a lot of time >>>>> >>>>> Just something I was musing as I spent another evening waiting for slow >>>>> builds to try to confirm changes to brittle poms. >>>>> >>>>> Kenn >>>>> >>>>> >>>> > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com >
