You can always run tests on post shaded artifacts instead of the compiled
classes, it just requires us to change our maven surefire / gradle test
configurations but it is true that most IDEs would behave better with a
dependency jar unless you delegate all the build/test actions to the build
system and then it won't matter.

On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <[email protected]> wrote:

> There's also, with additional overhead,
>
> 1a) A relocated and shipped package for each thing we want to relocate. I
> think this has also been tried outside Beam...
>
> Pros:
> * all the pros of 1) plus no bloat beyond what is necessary
> Cons:
> * abandons whitelist approach for public deps, reverting to blacklist
> approach for trouble things like guava, so a bit less principled
>
> For both 1) and 1a) I would add:
>
> Pros:
> * clearly readable dependency since code will `import
> org.apache.beam.private.guava21` and IDEs will understand it is a
> distinct lilbrary
> * can run tests on unpackaged classes, as long as deps are shaded or
> provided as jars
> * no mysterious action at a distance from inherited configuration
> Cons:
> * need to adjust imports
>
> Kenn
>
> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <[email protected]> wrote:
>
>> I would suggest that either we use:
>> 1) A common deps package containing shaded dependencies allows for
>> Pros
>> * doesn't require the user to build an uber jar
>> Risks
>> * dependencies package will keep growing even if something is or isn't
>> needed by all of Apache Beam leading to a large jar anyways negating any
>> space savings
>>
>> 2) Shade within each module to a common location like
>> org.apache.beam.relocated.guava....
>> Pros
>> * you only get the shaded dependencies of the things that are required
>> * its one less dependency for users to manage
>> Risks
>> * requires an uber jar to be built to get the space savings (either by a
>> user or a distribution of Apache Beam) otherwise we negate any space
>> savings.
>>
>> If we either use a common relocation scheme or a dependencies jar then
>> each relocation should specifically contain the version number of the
>> package because we would like to allow for us to be using two different
>> versions of the same library.
>>
>> For the common deps package approach, should we check in code where the
>> imports contain the relocated location (e.g. import
>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>
>>
>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>>> Thanks for bringing that back.
>>>
>>> Indeed guava is shaded in different uber-jar. Maybe we can have a common
>>> deps module that we include once (but the user will have to explicitly
>>> define the dep) ?
>>>
>>> Basically, what do you propose for protobuf (unfortunately, I don't see
>>> an obvious) ?
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>>
>>>> Hello, I wanted to bring back this subject because I think we should
>>>> take action on this and at least first have a shaded version of guava.
>>>> I was playing with a toy project and I did the procedure we use to
>>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>>> creating an uber jar and I realized that the size of the jar was way
>>>> bigger than I expected, and the fact that we shade guava in every
>>>> module contributes to this. I found guava shaded on:
>>>>
>>>> sdks/java/core
>>>> runners/core-construction-java
>>>> runners/core-java
>>>> model/job-management
>>>> runners/spark
>>>> sdks/java/io/hadoop-file-system
>>>> sdks/java/io/kafka
>>>>
>>>> This means at least 6 times more of the size it should which counts in
>>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>>> reduce by using a shaded version of the library.
>>>>
>>>> I add this point to the previous ones mentioned during the discussion
>>>> because this goes against the end user experience and affects us all
>>>> (devs/users).
>>>>
>>>> Another question is if we should shade (and how) protocol buffers
>>>> because now with the portability work we are exposing it closer to the
>>>> end users. I say this because I also found an issue while running a
>>>> job on YARN with the spark runner because hadoop-common includes
>>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>>>> a dependency to be able to use triggers (note the Spark runner
>>>> translates them using some method from runners/core-java). Since
>>>> hadoop-common is provided in the cluster with the older version of
>>>> protobuf, I am afraid that this will bite us in the future.
>>>>
>>>> Ismaël
>>>>
>>>> ps. There is already a JIRA for that shading for protobuf on
>>>> hadoop-common but this is not coming until version 3 is out.
>>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>
>>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>>> up having guava shaded twice via its shaded version on
>>>> core-construction-java.
>>>>
>>>> ps3. Of course this message means a de-facto +1 at least to do it for
>>>> guava and evaluate it for other libs.
>>>>
>>>>
>>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <[email protected]>
>>>> wrote:
>>>>
>>>>> An issue to call out is how to deal with our generated code (.avro and
>>>>> .proto) as I don't believe those plugins allow you to generate code
>>>>> using a
>>>>> shaded package prefix on imports.
>>>>>
>>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <[email protected]
>>>>> >
>>>>> wrote:
>>>>>
>>>>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>>>>>> every time for dependencies we know we'll use.
>>>>>>
>>>>>> This also means that if we end up pulling in transitive dependencies
>>>>>> we
>>>>>> don't want in any particular module we can avoid having to adjust our
>>>>>> repackaging strategy for that module - which I have run into
>>>>>> face-first in
>>>>>> the past.
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles
>>>>>> <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>>
>>>>>>> Shading is a big part of how we keep our dependencies sane in Beam.
>>>>>>> But
>>>>>>> downsides: shading is super slow, causes massive jar bloat, and kind
>>>>>>> of
>>>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>>>
>>>>>>> I know that some communities distribute their own shaded
>>>>>>> distributions of
>>>>>>> dependencies. I had a thought about doing something similar that I
>>>>>>> wanted
>>>>>>> to throw out there for people to poke holes in.
>>>>>>>
>>>>>>> To set the scene, here is how I view shading:
>>>>>>>
>>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>>   - Public deps are used for data interchange; users must share these
>>>>>>>
>>>>>> deps.
>>>>>>
>>>>>>>   - Private deps are just functionality and can be hidden (in our
>>>>>>> case,
>>>>>>> relocated + bundled)
>>>>>>>   - It isn't necessarily that simple, because public and private deps
>>>>>>>
>>>>>> might
>>>>>>
>>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>>
>>>>>>> Shading is an implementation detail of expressing these
>>>>>>> characteristics.
>>>>>>>
>>>>>> We
>>>>>>
>>>>>>> use shading selectively because of its downsides I mentioned above.
>>>>>>>
>>>>>>> But what about this idea: Introduce shaded deps as a single separate
>>>>>>> artifact.
>>>>>>>
>>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated versions
>>>>>>> of
>>>>>>> everything we want to shade
>>>>>>>
>>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
>>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is
>>>>>>> what
>>>>>>> they
>>>>>>> are rewritten to
>>>>>>>
>>>>>>> Some benefits
>>>>>>>
>>>>>>>   - much faster builds of other modules
>>>>>>>   - only one shaded uber jar
>>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>>>>   - configuration all in one place
>>>>>>>   - no automated rewriting of our real code, which has led to some
>>>>>>> major
>>>>>>> confusion
>>>>>>>   - easy to implement incrementally
>>>>>>>
>>>>>>> Downsides:
>>>>>>>
>>>>>>>   - plenty of effort work to get there
>>>>>>>   - unclear how many different such deps modules we need; sharing
>>>>>>> them
>>>>>>>
>>>>>> could
>>>>>>
>>>>>>> get weird
>>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>>
>>>>>>> Just something I was musing as I spent another evening waiting for
>>>>>>> slow
>>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>>
>>>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> [email protected]
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>

Reply via email to