Thanks Ismaël for bringing up this discussion again!

I would be in favour of 1) and more specifically of 1a)

Aljoscha

> On 12. Dec 2017, at 18:56, Lukasz Cwik <[email protected]> wrote:
> 
> You can always run tests on post shaded artifacts instead of the compiled 
> classes, it just requires us to change our maven surefire / gradle test 
> configurations but it is true that most IDEs would behave better with a 
> dependency jar unless you delegate all the build/test actions to the build 
> system and then it won't matter.
> 
> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <[email protected] 
> <mailto:[email protected]>> wrote:
> There's also, with additional overhead,
> 
> 1a) A relocated and shipped package for each thing we want to relocate. I 
> think this has also been tried outside Beam...
> 
> Pros:
> * all the pros of 1) plus no bloat beyond what is necessary
> Cons:
> * abandons whitelist approach for public deps, reverting to blacklist 
> approach for trouble things like guava, so a bit less principled
> 
> For both 1) and 1a) I would add:
> 
> Pros:
> * clearly readable dependency since code will `import 
> org.apache.beam.private.guava21` and IDEs will understand it is a distinct 
> lilbrary
> * can run tests on unpackaged classes, as long as deps are shaded or provided 
> as jars
> * no mysterious action at a distance from inherited configuration
> Cons:
> * need to adjust imports
> 
> Kenn
> 
> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <[email protected] 
> <mailto:[email protected]>> wrote:
> I would suggest that either we use:
> 1) A common deps package containing shaded dependencies allows for 
> Pros
> * doesn't require the user to build an uber jar
> Risks
> * dependencies package will keep growing even if something is or isn't needed 
> by all of Apache Beam leading to a large jar anyways negating any space 
> savings
> 
> 2) Shade within each module to a common location like 
> org.apache.beam.relocated.guava....
> Pros
> * you only get the shaded dependencies of the things that are required
> * its one less dependency for users to manage
> Risks
> * requires an uber jar to be built to get the space savings (either by a user 
> or a distribution of Apache Beam) otherwise we negate any space savings.
> 
> If we either use a common relocation scheme or a dependencies jar then each 
> relocation should specifically contain the version number of the package 
> because we would like to allow for us to be using two different versions of 
> the same library.
> 
> For the common deps package approach, should we check in code where the 
> imports contain the relocated location (e.g. import 
> org.apache.beam.guava.20.0.com 
> <http://org.apache.beam.guava.20.0.com/>.google.common.collect.ImmutableList)?
> 
> 
> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks for bringing that back.
> 
> Indeed guava is shaded in different uber-jar. Maybe we can have a common deps 
> module that we include once (but the user will have to explicitly define the 
> dep) ?
> 
> Basically, what do you propose for protobuf (unfortunately, I don't see an 
> obvious) ?
> 
> Regards
> JB
> 
> 
> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
> Hello, I wanted to bring back this subject because I think we should
> take action on this and at least first have a shaded version of guava.
> I was playing with a toy project and I did the procedure we use to
> submit jars to a Hadoop cluster via Flink/Spark which involves
> creating an uber jar and I realized that the size of the jar was way
> bigger than I expected, and the fact that we shade guava in every
> module contributes to this. I found guava shaded on:
> 
> sdks/java/core
> runners/core-construction-java
> runners/core-java
> model/job-management
> runners/spark
> sdks/java/io/hadoop-file-system
> sdks/java/io/kafka
> 
> This means at least 6 times more of the size it should which counts in
> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
> reduce by using a shaded version of the library.
> 
> I add this point to the previous ones mentioned during the discussion
> because this goes against the end user experience and affects us all
> (devs/users).
> 
> Another question is if we should shade (and how) protocol buffers
> because now with the portability work we are exposing it closer to the
> end users. I say this because I also found an issue while running a
> job on YARN with the spark runner because hadoop-common includes
> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
> a dependency to be able to use triggers (note the Spark runner
> translates them using some method from runners/core-java). Since
> hadoop-common is provided in the cluster with the older version of
> protobuf, I am afraid that this will bite us in the future.
> 
> Ismaël
> 
> ps. There is already a JIRA for that shading for protobuf on
> hadoop-common but this is not coming until version 3 is out.
> https://issues.apache.org/jira/browse/HADOOP-13136 
> <https://issues.apache.org/jira/browse/HADOOP-13136>
> 
> ps2. Extra curious situation is to see that the dataflow-runner ends
> up having guava shaded twice via its shaded version on
> core-construction-java.
> 
> ps3. Of course this message means a de-facto +1 at least to do it for
> guava and evaluate it for other libs.
> 
> 
> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <[email protected]> wrote:
> An issue to call out is how to deal with our generated code (.avro and
> .proto) as I don't believe those plugins allow you to generate code using a
> shaded package prefix on imports.
> 
> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <[email protected]>
> wrote:
> 
> +1 to the goal. I'm hugely in favor of not doing the same shading work
> every time for dependencies we know we'll use.
> 
> This also means that if we end up pulling in transitive dependencies we
> don't want in any particular module we can avoid having to adjust our
> repackaging strategy for that module - which I have run into face-first in
> the past.
> 
> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <[email protected]>
> wrote:
> 
> Hi all,
> 
> Shading is a big part of how we keep our dependencies sane in Beam. But
> downsides: shading is super slow, causes massive jar bloat, and kind of
> hard to get right because artifacts and namespaces are not 1-to-1.
> 
> I know that some communities distribute their own shaded distributions of
> dependencies. I had a thought about doing something similar that I wanted
> to throw out there for people to poke holes in.
> 
> To set the scene, here is how I view shading:
> 
>   - A module has public dependencies and private dependencies.
>   - Public deps are used for data interchange; users must share these
> deps.
>   - Private deps are just functionality and can be hidden (in our case,
> relocated + bundled)
>   - It isn't necessarily that simple, because public and private deps
> might
> interact in higher-order ways ("public" is contagious)
> 
> Shading is an implementation detail of expressing these characteristics.
> We
> use shading selectively because of its downsides I mentioned above.
> 
> But what about this idea: Introduce shaded deps as a single separate
> artifact.
> 
>   - sdks/java/private-deps: bundled uber jar with relocated versions of
> everything we want to shade
> 
>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
> depends on `beam-sdks-java-private-deps` and imports like
> `org.apache.beam.sdk.private.com.google.common` directly (this is what
> they
> are rewritten to
> 
> Some benefits
> 
>   - much faster builds of other modules
>   - only one shaded uber jar
>   - rare/no rebuilds of the uber jar
>   - can use maven enforcer to forbid imports like com.google.common
>   - configuration all in one place
>   - no automated rewriting of our real code, which has led to some major
> confusion
>   - easy to implement incrementally
> 
> Downsides:
> 
>   - plenty of effort work to get there
>   - unclear how many different such deps modules we need; sharing them
> could
> get weird
>   - if we hit a roadblock, we will have committed a lot of time
> 
> Just something I was musing as I spent another evening waiting for slow
> builds to try to confirm changes to brittle poms.
> 
> Kenn
> 
> 
> 
> -- 
> Jean-Baptiste Onofré
> [email protected] <mailto:[email protected]>
> http://blog.nanthrax.net <http://blog.nanthrax.net/>
> Talend - http://www.talend.com <http://www.talend.com/>
> 
> 
> 

Reply via email to