There's also, with additional overhead,

1a) A relocated and shipped package for each thing we want to relocate. I
think this has also been tried outside Beam...

Pros:
* all the pros of 1) plus no bloat beyond what is necessary
Cons:
* abandons whitelist approach for public deps, reverting to blacklist
approach for trouble things like guava, so a bit less principled

For both 1) and 1a) I would add:

Pros:
* clearly readable dependency since code will `import
org.apache.beam.private.guava21` and IDEs will understand it is a distinct
lilbrary
* can run tests on unpackaged classes, as long as deps are shaded or
provided as jars
* no mysterious action at a distance from inherited configuration
Cons:
* need to adjust imports

Kenn

On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <[email protected]> wrote:

> I would suggest that either we use:
> 1) A common deps package containing shaded dependencies allows for
> Pros
> * doesn't require the user to build an uber jar
> Risks
> * dependencies package will keep growing even if something is or isn't
> needed by all of Apache Beam leading to a large jar anyways negating any
> space savings
>
> 2) Shade within each module to a common location like
> org.apache.beam.relocated.guava....
> Pros
> * you only get the shaded dependencies of the things that are required
> * its one less dependency for users to manage
> Risks
> * requires an uber jar to be built to get the space savings (either by a
> user or a distribution of Apache Beam) otherwise we negate any space
> savings.
>
> If we either use a common relocation scheme or a dependencies jar then
> each relocation should specifically contain the version number of the
> package because we would like to allow for us to be using two different
> versions of the same library.
>
> For the common deps package approach, should we check in code where the
> imports contain the relocated location (e.g. import
> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>
>
> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Thanks for bringing that back.
>>
>> Indeed guava is shaded in different uber-jar. Maybe we can have a common
>> deps module that we include once (but the user will have to explicitly
>> define the dep) ?
>>
>> Basically, what do you propose for protobuf (unfortunately, I don't see
>> an obvious) ?
>>
>> Regards
>> JB
>>
>>
>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>
>>> Hello, I wanted to bring back this subject because I think we should
>>> take action on this and at least first have a shaded version of guava.
>>> I was playing with a toy project and I did the procedure we use to
>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>> creating an uber jar and I realized that the size of the jar was way
>>> bigger than I expected, and the fact that we shade guava in every
>>> module contributes to this. I found guava shaded on:
>>>
>>> sdks/java/core
>>> runners/core-construction-java
>>> runners/core-java
>>> model/job-management
>>> runners/spark
>>> sdks/java/io/hadoop-file-system
>>> sdks/java/io/kafka
>>>
>>> This means at least 6 times more of the size it should which counts in
>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>> reduce by using a shaded version of the library.
>>>
>>> I add this point to the previous ones mentioned during the discussion
>>> because this goes against the end user experience and affects us all
>>> (devs/users).
>>>
>>> Another question is if we should shade (and how) protocol buffers
>>> because now with the portability work we are exposing it closer to the
>>> end users. I say this because I also found an issue while running a
>>> job on YARN with the spark runner because hadoop-common includes
>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>>> a dependency to be able to use triggers (note the Spark runner
>>> translates them using some method from runners/core-java). Since
>>> hadoop-common is provided in the cluster with the older version of
>>> protobuf, I am afraid that this will bite us in the future.
>>>
>>> Ismaël
>>>
>>> ps. There is already a JIRA for that shading for protobuf on
>>> hadoop-common but this is not coming until version 3 is out.
>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>
>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>> up having guava shaded twice via its shaded version on
>>> core-construction-java.
>>>
>>> ps3. Of course this message means a de-facto +1 at least to do it for
>>> guava and evaluate it for other libs.
>>>
>>>
>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <[email protected]>
>>> wrote:
>>>
>>>> An issue to call out is how to deal with our generated code (.avro and
>>>> .proto) as I don't believe those plugins allow you to generate code
>>>> using a
>>>> shaded package prefix on imports.
>>>>
>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <[email protected]
>>>> >
>>>> wrote:
>>>>
>>>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>>>>> every time for dependencies we know we'll use.
>>>>>
>>>>> This also means that if we end up pulling in transitive dependencies we
>>>>> don't want in any particular module we can avoid having to adjust our
>>>>> repackaging strategy for that module - which I have run into
>>>>> face-first in
>>>>> the past.
>>>>>
>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles
>>>>> <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>>
>>>>>> Shading is a big part of how we keep our dependencies sane in Beam.
>>>>>> But
>>>>>> downsides: shading is super slow, causes massive jar bloat, and kind
>>>>>> of
>>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>>
>>>>>> I know that some communities distribute their own shaded
>>>>>> distributions of
>>>>>> dependencies. I had a thought about doing something similar that I
>>>>>> wanted
>>>>>> to throw out there for people to poke holes in.
>>>>>>
>>>>>> To set the scene, here is how I view shading:
>>>>>>
>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>   - Public deps are used for data interchange; users must share these
>>>>>>
>>>>> deps.
>>>>>
>>>>>>   - Private deps are just functionality and can be hidden (in our
>>>>>> case,
>>>>>> relocated + bundled)
>>>>>>   - It isn't necessarily that simple, because public and private deps
>>>>>>
>>>>> might
>>>>>
>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>
>>>>>> Shading is an implementation detail of expressing these
>>>>>> characteristics.
>>>>>>
>>>>> We
>>>>>
>>>>>> use shading selectively because of its downsides I mentioned above.
>>>>>>
>>>>>> But what about this idea: Introduce shaded deps as a single separate
>>>>>> artifact.
>>>>>>
>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated versions
>>>>>> of
>>>>>> everything we want to shade
>>>>>>
>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is
>>>>>> what
>>>>>> they
>>>>>> are rewritten to
>>>>>>
>>>>>> Some benefits
>>>>>>
>>>>>>   - much faster builds of other modules
>>>>>>   - only one shaded uber jar
>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>>>   - configuration all in one place
>>>>>>   - no automated rewriting of our real code, which has led to some
>>>>>> major
>>>>>> confusion
>>>>>>   - easy to implement incrementally
>>>>>>
>>>>>> Downsides:
>>>>>>
>>>>>>   - plenty of effort work to get there
>>>>>>   - unclear how many different such deps modules we need; sharing them
>>>>>>
>>>>> could
>>>>>
>>>>>> get weird
>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>
>>>>>> Just something I was musing as I spent another evening waiting for
>>>>>> slow
>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Reply via email to