There's also, with additional overhead, 1a) A relocated and shipped package for each thing we want to relocate. I think this has also been tried outside Beam...
Pros: * all the pros of 1) plus no bloat beyond what is necessary Cons: * abandons whitelist approach for public deps, reverting to blacklist approach for trouble things like guava, so a bit less principled For both 1) and 1a) I would add: Pros: * clearly readable dependency since code will `import org.apache.beam.private.guava21` and IDEs will understand it is a distinct lilbrary * can run tests on unpackaged classes, as long as deps are shaded or provided as jars * no mysterious action at a distance from inherited configuration Cons: * need to adjust imports Kenn On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <[email protected]> wrote: > I would suggest that either we use: > 1) A common deps package containing shaded dependencies allows for > Pros > * doesn't require the user to build an uber jar > Risks > * dependencies package will keep growing even if something is or isn't > needed by all of Apache Beam leading to a large jar anyways negating any > space savings > > 2) Shade within each module to a common location like > org.apache.beam.relocated.guava.... > Pros > * you only get the shaded dependencies of the things that are required > * its one less dependency for users to manage > Risks > * requires an uber jar to be built to get the space savings (either by a > user or a distribution of Apache Beam) otherwise we negate any space > savings. > > If we either use a common relocation scheme or a dependencies jar then > each relocation should specifically contain the version number of the > package because we would like to allow for us to be using two different > versions of the same library. > > For the common deps package approach, should we check in code where the > imports contain the relocated location (e.g. import > org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)? > > > On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <[email protected]> > wrote: > >> Thanks for bringing that back. >> >> Indeed guava is shaded in different uber-jar. Maybe we can have a common >> deps module that we include once (but the user will have to explicitly >> define the dep) ? >> >> Basically, what do you propose for protobuf (unfortunately, I don't see >> an obvious) ? >> >> Regards >> JB >> >> >> On 12/11/2017 05:35 PM, Ismaël Mejía wrote: >> >>> Hello, I wanted to bring back this subject because I think we should >>> take action on this and at least first have a shaded version of guava. >>> I was playing with a toy project and I did the procedure we use to >>> submit jars to a Hadoop cluster via Flink/Spark which involves >>> creating an uber jar and I realized that the size of the jar was way >>> bigger than I expected, and the fact that we shade guava in every >>> module contributes to this. I found guava shaded on: >>> >>> sdks/java/core >>> runners/core-construction-java >>> runners/core-java >>> model/job-management >>> runners/spark >>> sdks/java/io/hadoop-file-system >>> sdks/java/io/kafka >>> >>> This means at least 6 times more of the size it should which counts in >>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply >>> reduce by using a shaded version of the library. >>> >>> I add this point to the previous ones mentioned during the discussion >>> because this goes against the end user experience and affects us all >>> (devs/users). >>> >>> Another question is if we should shade (and how) protocol buffers >>> because now with the portability work we are exposing it closer to the >>> end users. I say this because I also found an issue while running a >>> job on YARN with the spark runner because hadoop-common includes >>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as >>> a dependency to be able to use triggers (note the Spark runner >>> translates them using some method from runners/core-java). Since >>> hadoop-common is provided in the cluster with the older version of >>> protobuf, I am afraid that this will bite us in the future. >>> >>> Ismaël >>> >>> ps. There is already a JIRA for that shading for protobuf on >>> hadoop-common but this is not coming until version 3 is out. >>> https://issues.apache.org/jira/browse/HADOOP-13136 >>> >>> ps2. Extra curious situation is to see that the dataflow-runner ends >>> up having guava shaded twice via its shaded version on >>> core-construction-java. >>> >>> ps3. Of course this message means a de-facto +1 at least to do it for >>> guava and evaluate it for other libs. >>> >>> >>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <[email protected]> >>> wrote: >>> >>>> An issue to call out is how to deal with our generated code (.avro and >>>> .proto) as I don't believe those plugins allow you to generate code >>>> using a >>>> shaded package prefix on imports. >>>> >>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <[email protected] >>>> > >>>> wrote: >>>> >>>> +1 to the goal. I'm hugely in favor of not doing the same shading work >>>>> every time for dependencies we know we'll use. >>>>> >>>>> This also means that if we end up pulling in transitive dependencies we >>>>> don't want in any particular module we can avoid having to adjust our >>>>> repackaging strategy for that module - which I have run into >>>>> face-first in >>>>> the past. >>>>> >>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles >>>>> <[email protected]> >>>>> wrote: >>>>> >>>>> Hi all, >>>>>> >>>>>> Shading is a big part of how we keep our dependencies sane in Beam. >>>>>> But >>>>>> downsides: shading is super slow, causes massive jar bloat, and kind >>>>>> of >>>>>> hard to get right because artifacts and namespaces are not 1-to-1. >>>>>> >>>>>> I know that some communities distribute their own shaded >>>>>> distributions of >>>>>> dependencies. I had a thought about doing something similar that I >>>>>> wanted >>>>>> to throw out there for people to poke holes in. >>>>>> >>>>>> To set the scene, here is how I view shading: >>>>>> >>>>>> - A module has public dependencies and private dependencies. >>>>>> - Public deps are used for data interchange; users must share these >>>>>> >>>>> deps. >>>>> >>>>>> - Private deps are just functionality and can be hidden (in our >>>>>> case, >>>>>> relocated + bundled) >>>>>> - It isn't necessarily that simple, because public and private deps >>>>>> >>>>> might >>>>> >>>>>> interact in higher-order ways ("public" is contagious) >>>>>> >>>>>> Shading is an implementation detail of expressing these >>>>>> characteristics. >>>>>> >>>>> We >>>>> >>>>>> use shading selectively because of its downsides I mentioned above. >>>>>> >>>>>> But what about this idea: Introduce shaded deps as a single separate >>>>>> artifact. >>>>>> >>>>>> - sdks/java/private-deps: bundled uber jar with relocated versions >>>>>> of >>>>>> everything we want to shade >>>>>> >>>>>> - sdks/java/core and sdks/java/harness: no relocation or bundling - >>>>>> depends on `beam-sdks-java-private-deps` and imports like >>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is >>>>>> what >>>>>> they >>>>>> are rewritten to >>>>>> >>>>>> Some benefits >>>>>> >>>>>> - much faster builds of other modules >>>>>> - only one shaded uber jar >>>>>> - rare/no rebuilds of the uber jar >>>>>> - can use maven enforcer to forbid imports like com.google.common >>>>>> - configuration all in one place >>>>>> - no automated rewriting of our real code, which has led to some >>>>>> major >>>>>> confusion >>>>>> - easy to implement incrementally >>>>>> >>>>>> Downsides: >>>>>> >>>>>> - plenty of effort work to get there >>>>>> - unclear how many different such deps modules we need; sharing them >>>>>> >>>>> could >>>>> >>>>>> get weird >>>>>> - if we hit a roadblock, we will have committed a lot of time >>>>>> >>>>>> Just something I was musing as I spent another evening waiting for >>>>>> slow >>>>>> builds to try to confirm changes to brittle poms. >>>>>> >>>>>> Kenn >>>>>> >>>>>> >>>>> >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> > >
