2018-02-02 15:37 GMT+01:00 Kenneth Knowles <k...@google.com>: > Another couple: > > - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken >
Technically reflect Type is enough > - ImmutableList and friends and their builders are very widely used and > IMO still add a lot for readability and preventing someone coming along and > adding mistakes to a codebase > Sugar but not required. When you compare the cost of a shade versus of duplicating the parts we need there is no real match IMHO. > > So considering it all, I would keep a vendored Guava (but also move off > where we can, and also have our own improvements). I hope it will be a > near-empty build file to generate it so not a maintenance burden. > > Kenn > > On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <k...@google.com> wrote: > >> Nice. This sounds like a great idea (or two?) and goes along with what I >> just started for futures. >> >> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and >> assigned to Ismaël :-) and converted my futures thing to a subtask. >> >> Specific things for our micro guava: >> >> - checkArgumentNotNull can throw the right exception >> - our own Optional because Java's is not Serializable >> - futures combinators since many are missing, especially Java's don't do >> exceptions right >> >> Protobuf: didn't file an issue because I'm not sure >> >> I was wondering if pre-shading works. We really need to get rid of it >> from public APIs in a 100% reliable way. It is also a problem for Dataflow. >> I was wondering if one approach is to pre-shade gcpio-protobuf-java, >> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take >> a Message object. (and do the same for beam-model-protobuf-java since the >> model bits have to depend on each other but nothing else). >> >> Kenn >> >> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ieme...@gmail.com> wrote: >> >>> Huge +1 to get rid of Guava! >>> >>> This solves annoying dependency issues for some IOs and allow us to >>> get rid of the shading that makes current jars bigger than they should >>> be. >>> >>> We can create our own 'micro guava' package with some classes for >>> things that are hard to migrate, or that we prefer to still have like >>> the check* methods for example. Given the size of the task we should >>> probably divide it into subtasks, more important is to get rid of it >>> for 'sdks/java/core'. We can then attack other areas afterwards. >>> >>> Other important idea would be to get rid of Protobuf in public APIs >>> like GCPIO and to better shade it from leaking into the runners. An >>> unexpected side effect of this is a leak of netty via gRPC/protobuf >>> that is byting us for the Spark runner, but well that's worth a >>> different discussion. >>> >>> >>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau >>> <rmannibu...@gmail.com> wrote: >>> > a map of list is fine and not a challenge we'll face long I hope ;) >>> > >>> > >>> > Romain Manni-Bucau >>> > @rmannibucau | Blog | Old Blog | Github | LinkedIn >>> > >>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>: >>> >> >>> >> Not sure we'll be able to replace them all. Things like guava Table >>> and >>> >> Multimap don't have great replacements in Java8. >>> >> >>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré < >>> j...@nanthrax.net> >>> >> wrote: >>> >>> >>> >>> +1, it was on my TODO for a while waiting the Java8 update. >>> >>> >>> >>> Regards >>> >>> JB >>> >>> >>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote: >>> >>> > Why not dropping guava for all beam codebase? With java 8 it is >>> quite >>> >>> > easy to do >>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite >>> a >>> >>> > good result. >>> >>> > >>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lc...@google.com >>> >>> > <mailto:lc...@google.com>> a écrit : >>> >>> > >>> >>> > Make sure to include the guava version in the artifact name so >>> that >>> >>> > we can >>> >>> > have multiple vendored versions. >>> >>> > >>> >>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles < >>> k...@google.com >>> >>> > <mailto:k...@google.com>> wrote: >>> >>> > >>> >>> > I didn't have time for this, but it just bit me. We >>> definitely >>> >>> > have >>> >>> > Guava on the API surface of runner support code in ways >>> that >>> >>> > get >>> >>> > incompatibly shaded. I will probably start "1a" by making a >>> >>> > shaded >>> >>> > library org.apache.beam:vendored-guava and starting to use >>> it. >>> >>> > It sounds >>> >>> > like there is generally unanimous support for that much, >>> >>> > anyhow. >>> >>> > >>> >>> > Kenn >>> >>> > >>> >>> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek >>> >>> > <aljos...@apache.org >>> >>> > <mailto:aljos...@apache.org>> wrote: >>> >>> > >>> >>> > Thanks Ismaël for bringing up this discussion again! >>> >>> > >>> >>> > I would be in favour of 1) and more specifically of 1a) >>> >>> > >>> >>> > Aljoscha >>> >>> > >>> >>> > >>> >>> >> On 12. Dec 2017, at 18:56, Lukasz Cwik < >>> lc...@google.com >>> >>> >> <mailto:lc...@google.com>> wrote: >>> >>> >> >>> >>> >> You can always run tests on post shaded artifacts >>> instead >>> >>> >> of the >>> >>> >> compiled classes, it just requires us to change our >>> maven >>> >>> >> surefire >>> >>> >> / gradle test configurations but it is true that most >>> IDEs >>> >>> >> would >>> >>> >> behave better with a dependency jar unless you >>> delegate >>> >>> >> all the >>> >>> >> build/test actions to the build system and then it >>> won't >>> >>> >> matter. >>> >>> >> >>> >>> >> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles >>> >>> >> <k...@google.com >>> >>> >> <mailto:k...@google.com>> wrote: >>> >>> >> >>> >>> >> There's also, with additional overhead, >>> >>> >> >>> >>> >> 1a) A relocated and shipped package for each >>> thing we >>> >>> >> want to >>> >>> >> relocate. I think this has also been tried outside >>> >>> >> Beam... >>> >>> >> >>> >>> >> Pros: >>> >>> >> * all the pros of 1) plus no bloat beyond what is >>> >>> >> necessary >>> >>> >> Cons: >>> >>> >> * abandons whitelist approach for public deps, >>> >>> >> reverting to >>> >>> >> blacklist approach for trouble things like guava, >>> so a >>> >>> >> bit >>> >>> >> less principled >>> >>> >> >>> >>> >> For both 1) and 1a) I would add: >>> >>> >> >>> >>> >> Pros: >>> >>> >> * clearly readable dependency since code will >>> `import >>> >>> >> org.apache.beam.private.guava21` and IDEs will >>> >>> >> understand it >>> >>> >> is a distinct lilbrary >>> >>> >> * can run tests on unpackaged classes, as long as >>> deps >>> >>> >> are >>> >>> >> shaded or provided as jars >>> >>> >> * no mysterious action at a distance from >>> inherited >>> >>> >> configuration >>> >>> >> Cons: >>> >>> >> * need to adjust imports >>> >>> >> >>> >>> >> Kenn >>> >>> >> >>> >>> >> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik >>> >>> >> <lc...@google.com >>> >>> >> <mailto:lc...@google.com>> wrote: >>> >>> >> >>> >>> >> I would suggest that either we use: >>> >>> >> 1) A common deps package containing shaded >>> >>> >> dependencies >>> >>> >> allows for >>> >>> >> Pros >>> >>> >> * doesn't require the user to build an uber >>> jar >>> >>> >> Risks >>> >>> >> * dependencies package will keep growing even >>> if >>> >>> >> something >>> >>> >> is or isn't needed by all of Apache Beam >>> leading >>> >>> >> to a >>> >>> >> large jar anyways negating any space savings >>> >>> >> >>> >>> >> 2) Shade within each module to a common >>> location >>> >>> >> like >>> >>> >> org.apache.beam.relocated.guava.... >>> >>> >> Pros >>> >>> >> * you only get the shaded dependencies of the >>> >>> >> things that >>> >>> >> are required >>> >>> >> * its one less dependency for users to manage >>> >>> >> Risks >>> >>> >> * requires an uber jar to be built to get the >>> >>> >> space >>> >>> >> savings (either by a user or a distribution of >>> >>> >> Apache >>> >>> >> Beam) otherwise we negate any space savings. >>> >>> >> >>> >>> >> If we either use a common relocation scheme >>> or a >>> >>> >> dependencies jar then each relocation should >>> >>> >> specifically >>> >>> >> contain the version number of the package >>> because >>> >>> >> we would >>> >>> >> like to allow for us to be using two different >>> >>> >> versions of >>> >>> >> the same library. >>> >>> >> >>> >>> >> For the common deps package approach, should >>> we >>> >>> >> check in >>> >>> >> code where the imports contain the relocated >>> >>> >> location >>> >>> >> (e.g. import org.apache.beam.guava.20.0.com >>> >>> >> >>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle >>> ct.ImmutableList)? >>> >>> >> >>> >>> >> >>> >>> >> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste >>> >>> >> Onofré >>> >>> >> <j...@nanthrax.net <mailto:j...@nanthrax.net>> >>> wrote: >>> >>> >> >>> >>> >> Thanks for bringing that back. >>> >>> >> >>> >>> >> Indeed guava is shaded in different >>> uber-jar. >>> >>> >> Maybe we >>> >>> >> can have a common deps module that we >>> include >>> >>> >> once >>> >>> >> (but the user will have to explicitly >>> define >>> >>> >> the dep) ? >>> >>> >> >>> >>> >> Basically, what do you propose for >>> protobuf >>> >>> >> (unfortunately, I don't see an obvious) ? >>> >>> >> >>> >>> >> Regards >>> >>> >> JB >>> >>> >> >>> >>> >> >>> >>> >> On 12/11/2017 05:35 PM, Ismaël Mejía >>> wrote: >>> >>> >> >>> >>> >> Hello, I wanted to bring back this >>> subject >>> >>> >> because >>> >>> >> I think we should >>> >>> >> take action on this and at least first >>> >>> >> have a >>> >>> >> shaded version of guava. >>> >>> >> I was playing with a toy project and >>> I did >>> >>> >> the >>> >>> >> procedure we use to >>> >>> >> submit jars to a Hadoop cluster via >>> >>> >> Flink/Spark >>> >>> >> which involves >>> >>> >> creating an uber jar and I realized >>> that >>> >>> >> the size >>> >>> >> of the jar was way >>> >>> >> bigger than I expected, and the fact >>> that >>> >>> >> we shade >>> >>> >> guava in every >>> >>> >> module contributes to this. I found >>> guava >>> >>> >> shaded on: >>> >>> >> >>> >>> >> sdks/java/core >>> >>> >> runners/core-construction-java >>> >>> >> runners/core-java >>> >>> >> model/job-management >>> >>> >> runners/spark >>> >>> >> sdks/java/io/hadoop-file-system >>> >>> >> sdks/java/io/kafka >>> >>> >> >>> >>> >> This means at least 6 times more of >>> the >>> >>> >> size it >>> >>> >> should which counts in >>> >>> >> around 15MB more (2.4MB*6 deps) of >>> extra >>> >>> >> weight >>> >>> >> that we can simply >>> >>> >> reduce by using a shaded version of >>> the >>> >>> >> library. >>> >>> >> >>> >>> >> I add this point to the previous ones >>> >>> >> mentioned >>> >>> >> during the discussion >>> >>> >> because this goes against the end user >>> >>> >> experience >>> >>> >> and affects us all >>> >>> >> (devs/users). >>> >>> >> >>> >>> >> Another question is if we should shade >>> >>> >> (and how) >>> >>> >> protocol buffers >>> >>> >> because now with the portability work >>> we >>> >>> >> are >>> >>> >> exposing it closer to the >>> >>> >> end users. I say this because I also >>> found >>> >>> >> an >>> >>> >> issue while running a >>> >>> >> job on YARN with the spark runner >>> because >>> >>> >> hadoop-common includes >>> >>> >> protobuf-java 2 and I had to >>> explicitly >>> >>> >> provide >>> >>> >> protocol-buffers 3 as >>> >>> >> a dependency to be able to use >>> triggers >>> >>> >> (note the >>> >>> >> Spark runner >>> >>> >> translates them using some method from >>> >>> >> runners/core-java). Since >>> >>> >> hadoop-common is provided in the >>> cluster >>> >>> >> with the >>> >>> >> older version of >>> >>> >> protobuf, I am afraid that this will >>> bite >>> >>> >> us in >>> >>> >> the future. >>> >>> >> >>> >>> >> Ismaël >>> >>> >> >>> >>> >> ps. There is already a JIRA for that >>> >>> >> shading for >>> >>> >> protobuf on >>> >>> >> hadoop-common but this is not coming >>> until >>> >>> >> version >>> >>> >> 3 is out. >>> >>> >> >>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136 >>> >>> >> >>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136> >>> >>> >> >>> >>> >> ps2. Extra curious situation is to see >>> >>> >> that the >>> >>> >> dataflow-runner ends >>> >>> >> up having guava shaded twice via its >>> >>> >> shaded version on >>> >>> >> core-construction-java. >>> >>> >> >>> >>> >> ps3. Of course this message means a >>> >>> >> de-facto +1 at >>> >>> >> least to do it for >>> >>> >> guava and evaluate it for other libs. >>> >>> >> >>> >>> >> >>> >>> >> On Tue, Oct 17, 2017 at 7:29 PM, >>> Lukasz >>> >>> >> Cwik >>> >>> >> <lc...@google.com.invalid >>> >>> >> <mailto:lc...@google.com.invalid>> >>> wrote: >>> >>> >> >>> >>> >> An issue to call out is how to >>> deal >>> >>> >> with our >>> >>> >> generated code (.avro and >>> >>> >> .proto) as I don't believe those >>> >>> >> plugins allow >>> >>> >> you to generate code using a >>> >>> >> shaded package prefix on imports. >>> >>> >> >>> >>> >> On Tue, Oct 17, 2017 at 10:28 AM, >>> >>> >> Thomas Groh >>> >>> >> <tg...@google.com.invalid >>> >>> >> <mailto:tg...@google.com.invalid >>> >> >>> >>> >> wrote: >>> >>> >> >>> >>> >> +1 to the goal. I'm hugely in >>> >>> >> favor of not >>> >>> >> doing the same shading work >>> >>> >> every time for dependencies we >>> >>> >> know we'll use. >>> >>> >> >>> >>> >> This also means that if we >>> end up >>> >>> >> pulling >>> >>> >> in transitive dependencies we >>> >>> >> don't want in any particular >>> >>> >> module we can >>> >>> >> avoid having to adjust our >>> >>> >> repackaging strategy for that >>> >>> >> module - >>> >>> >> which I have run into >>> face-first >>> >>> >> in >>> >>> >> the past. >>> >>> >> >>> >>> >> On Tue, Oct 17, 2017 at 9:48 >>> AM, >>> >>> >> Kenneth >>> >>> >> Knowles >>> <k...@google.com.invalid >>> >>> >> <mailto:k...@google.com.invali >>> d>> >>> >>> >> wrote: >>> >>> >> >>> >>> >> Hi all, >>> >>> >> >>> >>> >> Shading is a big part of >>> how >>> >>> >> we keep >>> >>> >> our dependencies sane in >>> Beam. >>> >>> >> But >>> >>> >> downsides: shading is >>> super >>> >>> >> slow, >>> >>> >> causes massive jar bloat, >>> and >>> >>> >> kind of >>> >>> >> hard to get right because >>> >>> >> artifacts >>> >>> >> and namespaces are not >>> 1-to-1. >>> >>> >> >>> >>> >> I know that some >>> communities >>> >>> >> distribute their own >>> shaded >>> >>> >> distributions of >>> >>> >> dependencies. I had a >>> thought >>> >>> >> about >>> >>> >> doing something similar >>> that I >>> >>> >> wanted >>> >>> >> to throw out there for >>> people >>> >>> >> to poke >>> >>> >> holes in. >>> >>> >> >>> >>> >> To set the scene, here is >>> how >>> >>> >> I view >>> >>> >> shading: >>> >>> >> >>> >>> >> - A module has public >>> >>> >> dependencies >>> >>> >> and private dependencies. >>> >>> >> - Public deps are used >>> for >>> >>> >> data >>> >>> >> interchange; users must >>> share >>> >>> >> these >>> >>> >> >>> >>> >> deps. >>> >>> >> >>> >>> >> - Private deps are just >>> >>> >> functionality and can be >>> >>> >> hidden (in >>> >>> >> our case, >>> >>> >> relocated + bundled) >>> >>> >> - It isn't necessarily >>> that >>> >>> >> simple, >>> >>> >> because public and private >>> >>> >> deps >>> >>> >> >>> >>> >> might >>> >>> >> >>> >>> >> interact in higher-order >>> ways >>> >>> >> ("public" is contagious) >>> >>> >> >>> >>> >> Shading is an >>> implementation >>> >>> >> detail of >>> >>> >> expressing these >>> >>> >> characteristics. >>> >>> >> >>> >>> >> We >>> >>> >> >>> >>> >> use shading selectively >>> >>> >> because of its >>> >>> >> downsides I mentioned >>> above. >>> >>> >> >>> >>> >> But what about this idea: >>> >>> >> Introduce >>> >>> >> shaded deps as a single >>> >>> >> separate >>> >>> >> artifact. >>> >>> >> >>> >>> >> - >>> sdks/java/private-deps: >>> >>> >> bundled >>> >>> >> uber jar with relocated >>> >>> >> versions of >>> >>> >> everything we want to >>> shade >>> >>> >> >>> >>> >> - sdks/java/core and >>> >>> >> sdks/java/harness: no >>> >>> >> relocation or >>> >>> >> bundling - >>> >>> >> depends on >>> >>> >> >>> `beam-sdks-java-private-deps` >>> >>> >> and >>> >>> >> imports like >>> >>> >> >>> >>> >> `org.apache.beam.sdk.private.com.google.common` >>> >>> >> directly (this is what >>> >>> >> they >>> >>> >> are rewritten to >>> >>> >> >>> >>> >> Some benefits >>> >>> >> >>> >>> >> - much faster builds of >>> >>> >> other modules >>> >>> >> - only one shaded uber >>> jar >>> >>> >> - rare/no rebuilds of >>> the >>> >>> >> uber jar >>> >>> >> - can use maven >>> enforcer to >>> >>> >> forbid >>> >>> >> imports like >>> com.google.common >>> >>> >> - configuration all in >>> one >>> >>> >> place >>> >>> >> - no automated >>> rewriting of >>> >>> >> our real >>> >>> >> code, which has led to >>> some >>> >>> >> major >>> >>> >> confusion >>> >>> >> - easy to implement >>> >>> >> incrementally >>> >>> >> >>> >>> >> Downsides: >>> >>> >> >>> >>> >> - plenty of effort work >>> to >>> >>> >> get there >>> >>> >> - unclear how many >>> different >>> >>> >> such >>> >>> >> deps modules we need; >>> sharing >>> >>> >> them >>> >>> >> >>> >>> >> could >>> >>> >> >>> >>> >> get weird >>> >>> >> - if we hit a >>> roadblock, we >>> >>> >> will >>> >>> >> have committed a lot of >>> time >>> >>> >> >>> >>> >> Just something I was >>> musing as >>> >>> >> I spent >>> >>> >> another evening waiting >>> for >>> >>> >> slow >>> >>> >> builds to try to confirm >>> >>> >> changes to >>> >>> >> brittle poms. >>> >>> >> >>> >>> >> Kenn >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> -- >>> >>> >> Jean-Baptiste Onofré >>> >>> >> jbono...@apache.org >>> >>> >> <mailto:jbono...@apache.org> >>> >>> >> http://blog.nanthrax.net >>> >>> >> <http://blog.nanthrax.net/> >>> >>> >> Talend - http://www.talend.com >>> >>> >> <http://www.talend.com/> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> > >>> >>> > >>> >>> > >>> >>> >>> >>> -- >>> >>> Jean-Baptiste Onofré >>> >>> jbono...@apache.org >>> >>> http://blog.nanthrax.net >>> >>> Talend - http://www.talend.com >>> >> >>> >> >>> > >>> >> >> >