You can always run tests on post shaded artifacts instead of the compiled classes, it just requires us to change our maven surefire / gradle test configurations but it is true that most IDEs would behave better with a dependency jar unless you delegate all the build/test actions to the build system and then it won't matter.
On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <[email protected]> wrote: > There's also, with additional overhead, > > 1a) A relocated and shipped package for each thing we want to relocate. I > think this has also been tried outside Beam... > > Pros: > * all the pros of 1) plus no bloat beyond what is necessary > Cons: > * abandons whitelist approach for public deps, reverting to blacklist > approach for trouble things like guava, so a bit less principled > > For both 1) and 1a) I would add: > > Pros: > * clearly readable dependency since code will `import > org.apache.beam.private.guava21` and IDEs will understand it is a > distinct lilbrary > * can run tests on unpackaged classes, as long as deps are shaded or > provided as jars > * no mysterious action at a distance from inherited configuration > Cons: > * need to adjust imports > > Kenn > > On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <[email protected]> wrote: > >> I would suggest that either we use: >> 1) A common deps package containing shaded dependencies allows for >> Pros >> * doesn't require the user to build an uber jar >> Risks >> * dependencies package will keep growing even if something is or isn't >> needed by all of Apache Beam leading to a large jar anyways negating any >> space savings >> >> 2) Shade within each module to a common location like >> org.apache.beam.relocated.guava.... >> Pros >> * you only get the shaded dependencies of the things that are required >> * its one less dependency for users to manage >> Risks >> * requires an uber jar to be built to get the space savings (either by a >> user or a distribution of Apache Beam) otherwise we negate any space >> savings. >> >> If we either use a common relocation scheme or a dependencies jar then >> each relocation should specifically contain the version number of the >> package because we would like to allow for us to be using two different >> versions of the same library. >> >> For the common deps package approach, should we check in code where the >> imports contain the relocated location (e.g. import >> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)? >> >> >> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <[email protected]> >> wrote: >> >>> Thanks for bringing that back. >>> >>> Indeed guava is shaded in different uber-jar. Maybe we can have a common >>> deps module that we include once (but the user will have to explicitly >>> define the dep) ? >>> >>> Basically, what do you propose for protobuf (unfortunately, I don't see >>> an obvious) ? >>> >>> Regards >>> JB >>> >>> >>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote: >>> >>>> Hello, I wanted to bring back this subject because I think we should >>>> take action on this and at least first have a shaded version of guava. >>>> I was playing with a toy project and I did the procedure we use to >>>> submit jars to a Hadoop cluster via Flink/Spark which involves >>>> creating an uber jar and I realized that the size of the jar was way >>>> bigger than I expected, and the fact that we shade guava in every >>>> module contributes to this. I found guava shaded on: >>>> >>>> sdks/java/core >>>> runners/core-construction-java >>>> runners/core-java >>>> model/job-management >>>> runners/spark >>>> sdks/java/io/hadoop-file-system >>>> sdks/java/io/kafka >>>> >>>> This means at least 6 times more of the size it should which counts in >>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply >>>> reduce by using a shaded version of the library. >>>> >>>> I add this point to the previous ones mentioned during the discussion >>>> because this goes against the end user experience and affects us all >>>> (devs/users). >>>> >>>> Another question is if we should shade (and how) protocol buffers >>>> because now with the portability work we are exposing it closer to the >>>> end users. I say this because I also found an issue while running a >>>> job on YARN with the spark runner because hadoop-common includes >>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as >>>> a dependency to be able to use triggers (note the Spark runner >>>> translates them using some method from runners/core-java). Since >>>> hadoop-common is provided in the cluster with the older version of >>>> protobuf, I am afraid that this will bite us in the future. >>>> >>>> Ismaël >>>> >>>> ps. There is already a JIRA for that shading for protobuf on >>>> hadoop-common but this is not coming until version 3 is out. >>>> https://issues.apache.org/jira/browse/HADOOP-13136 >>>> >>>> ps2. Extra curious situation is to see that the dataflow-runner ends >>>> up having guava shaded twice via its shaded version on >>>> core-construction-java. >>>> >>>> ps3. Of course this message means a de-facto +1 at least to do it for >>>> guava and evaluate it for other libs. >>>> >>>> >>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <[email protected]> >>>> wrote: >>>> >>>>> An issue to call out is how to deal with our generated code (.avro and >>>>> .proto) as I don't believe those plugins allow you to generate code >>>>> using a >>>>> shaded package prefix on imports. >>>>> >>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <[email protected] >>>>> > >>>>> wrote: >>>>> >>>>> +1 to the goal. I'm hugely in favor of not doing the same shading work >>>>>> every time for dependencies we know we'll use. >>>>>> >>>>>> This also means that if we end up pulling in transitive dependencies >>>>>> we >>>>>> don't want in any particular module we can avoid having to adjust our >>>>>> repackaging strategy for that module - which I have run into >>>>>> face-first in >>>>>> the past. >>>>>> >>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles >>>>>> <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>>> >>>>>>> Shading is a big part of how we keep our dependencies sane in Beam. >>>>>>> But >>>>>>> downsides: shading is super slow, causes massive jar bloat, and kind >>>>>>> of >>>>>>> hard to get right because artifacts and namespaces are not 1-to-1. >>>>>>> >>>>>>> I know that some communities distribute their own shaded >>>>>>> distributions of >>>>>>> dependencies. I had a thought about doing something similar that I >>>>>>> wanted >>>>>>> to throw out there for people to poke holes in. >>>>>>> >>>>>>> To set the scene, here is how I view shading: >>>>>>> >>>>>>> - A module has public dependencies and private dependencies. >>>>>>> - Public deps are used for data interchange; users must share these >>>>>>> >>>>>> deps. >>>>>> >>>>>>> - Private deps are just functionality and can be hidden (in our >>>>>>> case, >>>>>>> relocated + bundled) >>>>>>> - It isn't necessarily that simple, because public and private deps >>>>>>> >>>>>> might >>>>>> >>>>>>> interact in higher-order ways ("public" is contagious) >>>>>>> >>>>>>> Shading is an implementation detail of expressing these >>>>>>> characteristics. >>>>>>> >>>>>> We >>>>>> >>>>>>> use shading selectively because of its downsides I mentioned above. >>>>>>> >>>>>>> But what about this idea: Introduce shaded deps as a single separate >>>>>>> artifact. >>>>>>> >>>>>>> - sdks/java/private-deps: bundled uber jar with relocated versions >>>>>>> of >>>>>>> everything we want to shade >>>>>>> >>>>>>> - sdks/java/core and sdks/java/harness: no relocation or bundling - >>>>>>> depends on `beam-sdks-java-private-deps` and imports like >>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is >>>>>>> what >>>>>>> they >>>>>>> are rewritten to >>>>>>> >>>>>>> Some benefits >>>>>>> >>>>>>> - much faster builds of other modules >>>>>>> - only one shaded uber jar >>>>>>> - rare/no rebuilds of the uber jar >>>>>>> - can use maven enforcer to forbid imports like com.google.common >>>>>>> - configuration all in one place >>>>>>> - no automated rewriting of our real code, which has led to some >>>>>>> major >>>>>>> confusion >>>>>>> - easy to implement incrementally >>>>>>> >>>>>>> Downsides: >>>>>>> >>>>>>> - plenty of effort work to get there >>>>>>> - unclear how many different such deps modules we need; sharing >>>>>>> them >>>>>>> >>>>>> could >>>>>> >>>>>>> get weird >>>>>>> - if we hit a roadblock, we will have committed a lot of time >>>>>>> >>>>>>> Just something I was musing as I spent another evening waiting for >>>>>>> slow >>>>>>> builds to try to confirm changes to brittle poms. >>>>>>> >>>>>>> Kenn >>>>>>> >>>>>>> >>>>>> >>> -- >>> Jean-Baptiste Onofré >>> [email protected] >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >> >> >
