On Feb 26, 2014 11:12 PM, "Patrick Wendell" <pwend...@gmail.com> wrote: > > @mridul - As far as I know both Maven and Sbt use fairly similar > processes for building the assembly/uber jar. We actually used to > package spark with sbt and there were no specific issues we > encountered and AFAIK sbt respects versioning of transitive > dependencies correctly. Do you have a specific bug listing for sbt > that indicates something is broken?
Slightly longish ... The assembled jar, generated via sbt broke all over the place while I was adding yarn support in 0.6 - and I had to fix sbt project a fair bit to get it to work : we need the assembled jar to submit a yarn job. When I finally submitted those changes to 0.7, it broke even more - since dependencies changed : someone else had thankfully already added maven support by then - which worked remarkably well out of the box (with some minor tweaks) ! In theory, they might be expected to work the same, but practically they did not : as I mentioned, it must just have been luck that maven worked that well; but given multiple past nasty experiences with sbt, and the fact that it does not bring anything compelling or new in contrast, I am fairly against the idea of using only sbt - inspite of maven being unintuitive at times. Regards, Mridul > > @sandy - It sounds like you are saying that the CDH build would be > easier with Maven because you can inherit the POM. However, is this > just a matter of convenience for packagers or would standardizing on > sbt limit capabilities in some way? I assume that it would just mean a > bit more manual work for packagers having to figure out how to set the > hadoop version in SBT and exclude certain dependencies. For instance, > what does CDH about other components like Impala that are not based on > Maven at all? > > On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote: > > I'd like to propose the following way to move forward, based on the > > comments I've seen: > > > > 1. Aggressively clean up the giant dependency graph. One ticket I > > might work on if I have time is SPARK-681 which might remove the giant > > fastutil dependency (~15MB by itself). > > > > 2. Take an intermediate step by having only ONE source of truth > > w.r.t. dependencies and versions. This means either: > > a) Using a maven POM as the spec for dependencies, Hadoop version, > > etc. Then, use sbt-pom-reader to import it. > > b) Using the build.scala as the spec, and "sbt make-pom" to > > generate the pom.xml for the dependencies > > > > The idea is to remove the pain and errors associated with manual > > translation of dependency specs from one system to another, while > > still maintaining the things which are hard to translate (plugins). > > > > > > On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> We maintain in house spark build using sbt. We have no problem using sbt > >> assembly. We did add a few exclude statements for transitive dependencies. > >> > >> The main enemy of assemblies are jars that include stuff they shouldn't > >> (kryo comes to mind, I think they include logback?), new versions of jars > >> that change the provider/artifact without changing the package (asm), and > >> incompatible new releases (protobuf). These break the transitive resolution > >> process. I imagine that's true for any build tool. > >> > >> Besides shading I don't see anything maven can do sbt cannot, and if I > >> understand it correctly shading is not done currently using the build tool. > >> > >> Since spark is primarily scala/akka based the main developer base will be > >> familiar with sbt (I think?). Switching build tool is always painful. I > >> personally think it is smarter to put this burden on a limited number of > >> upstream integrators than on the community. However that said I don't think > >> its a problem for us to maintain an sbt build in-house if spark switched to > >> maven. > >> The problem is, the complete spark dependency graph is fairly large, > >> and there are lot of conflicting versions in there. > >> In particular, when we bump versions of dependencies - making managing > >> this messy at best. > >> > >> Now, I have not looked in detail at how maven manages this - it might > >> just be accidental that we get a decent out-of-the-box assembled > >> shaded jar (since we dont do anything great to configure it). > >> With current state of sbt in spark, it definitely is not a good > >> solution : if we can enhance it (or it already is ?), while keeping > >> the management of the version/dependency graph manageable, I dont have > >> any objections to using sbt or maven ! > >> Too many exclude versions, pinned versions, etc would just make things > >> unmanageable in future. > >> > >> > >> Regards, > >> Mridul > >> > >> > >> > >> > >> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote: > >>> Actually you can control exactly how sbt assembly merges or resolves > >> conflicts. I believe the default settings however lead to order which > >> cannot be controlled. > >>> > >>> I do wish for a smarter fat jar plugin. > >>> > >>> -Evan > >>> To be free is not merely to cast off one's chains, but to live in a way > >> that respects & enhances the freedom of others. (#NelsonMandela) > >>> > >>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <mri...@gmail.com> > >> wrote: > >>>> > >>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <pwend...@gmail.com > > >> wrote: > >>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in - > >>>>> right now we don't actually use it for bytecode shading - we simply > >>>>> use it for creating the uber jar with excludes (which sbt supports > >>>>> just fine via assembly). > >>>> > >>>> > >>>> Not really - as I mentioned initially in this thread, sbt's assembly > >>>> does not take dependencies into account properly : and can overwrite > >>>> newer classes with older versions. > >>>> From an assembly point of view, sbt is not very good : we are yet to > >>>> try it after 2.10 shift though (and probably wont, given the mess it > >>>> created last time). > >>>> > >>>> Regards, > >>>> Mridul > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>> > >>>>> I was wondering actually, do you know if it's possible to added shaded > >>>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber > >>>>> jar)? That's something I could see being really handy in the future. > >>>>> > >>>>> - Patrick > >>>>> > >>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com> wrote: > >>>>>> The problem is that plugins are not equivalent. There is AFAIK no > >>>>>> equivalent to the maven shader plugin for SBT. > >>>>>> There is an SBT plugin which can apparently read POM XML files > >>>>>> (sbt-pom-reader). However, it can't possibly handle plugins, which > >>>>>> is still problematic. > >>>>>> > >>>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com> wrote: > >>>>>>> I would prefer keep both of them, it would be better even if that > >> means > >>>>>>> pom.xml will be generated using sbt. Some company, like my current > >> one, > >>>>>>> have their own build infrastructures built on top of maven. It is not > >> easy > >>>>>>> to support sbt for these potential spark clients. But I do agree to > >> only > >>>>>>> keep one if there is a promising way to generate correct > >> configuration from > >>>>>>> the other. > >>>>>>> > >>>>>>> -Shengzhe > >>>>>>> > >>>>>>> > >>>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com> wrote: > >>>>>>>> > >>>>>>>> The correct way to exclude dependencies in SBT is actually to declare > >>>>>>>> a dependency as "provided". I'm not familiar with Maven or its > >>>>>>>> dependencySet, but provided will mark the entire dependency tree as > >>>>>>>> excluded. It is also possible to exclude jar by jar, but this is > >>>>>>>> pretty error prone and messy. > >>>>>>>> > >>>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers < ko...@tresata.com> > >> wrote: > >>>>>>>>> yes in sbt assembly you can exclude jars (although i never had a > >> need for > >>>>>>>>> this) and files in jars. > >>>>>>>>> > >>>>>>>>> for example i frequently remove log4j.properties, because for > >> whatever > >>>>>>>>> reason hadoop decided to include it making it very difficult to use > >> our > >>>>>>>> own > >>>>>>>>> logging config. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik < c...@apache.org > >>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote: > >>>>>>>>>>> Kos - thanks for chiming in. Could you be more specific about > >> what is > >>>>>>>>>>> available in maven and not in sbt for these issues? I took a look > >> at > >>>>>>>>>>> the bigtop code relating to Spark. As far as I could tell [1] was > >> the > >>>>>>>>>>> main point of integration with the build system (maybe there are > >> other > >>>>>>>>>>> integration points)? > >>>>>>>>>>> > >>>>>>>>>>>> - in order to integrate Spark well into existing Hadoop stack it > >>>>>>>> was > >>>>>>>>>>>> necessary to have a way to avoid transitive dependencies > >>>>>>>>>> duplications and > >>>>>>>>>>>> possible conflicts. > >>>>>>>>>>>> > >>>>>>>>>>>> E.g. Maven assembly allows us to avoid adding _all_ Hadoop > >> libs > >>>>>>>>>> and later > >>>>>>>>>>>> merely declare Spark package dependency on standard Bigtop > >>>>>>>> Hadoop > >>>>>>>>>>>> packages. And yes - Bigtop packaging means the naming and > >> layout > >>>>>>>>>> would be > >>>>>>>>>>>> standard across all commercial Hadoop distributions that are > >>>>>>>> worth > >>>>>>>>>>>> mentioning: ASF Bigtop convenience binary packages, and > >>>>>>>> Cloudera or > >>>>>>>>>>>> Hortonworks packages. Hence, the downstream user doesn't need > >> to > >>>>>>>>>> spend any > >>>>>>>>>>>> effort to make sure that Spark "clicks-in" properly. > >>>>>>>>>>> > >>>>>>>>>>> The sbt build also allows you to plug in a Hadoop version similar > >> to > >>>>>>>>>>> the maven build. > >>>>>>>>>> > >>>>>>>>>> I am actually talking about an ability to exclude a set of > >> dependencies > >>>>>>>>>> from an > >>>>>>>>>> assembly, similarly to what's happening in dependencySet sections > >> of > >>>>>>>>>> assembly/src/main/assembly/assembly.xml > >>>>>>>>>> If there is a comparable functionality in Sbt, that would help > >> quite a > >>>>>>>> bit, > >>>>>>>>>> apparently. > >>>>>>>>>> > >>>>>>>>>> Cos > >>>>>>>>>> > >>>>>>>>>>>> - Maven provides a relatively easy way to deal with the jar-hell > >>>>>>>>>> problem, > >>>>>>>>>>>> although the original maven build was just Shader'ing > >> everything > >>>>>>>>>> into a > >>>>>>>>>>>> huge lump of class files. Oftentimes ending up with classes > >>>>>>>>>> slamming on > >>>>>>>>>>>> top of each other from different transitive dependencies. > >>>>>>>>>>> > >>>>>>>>>>> AFIAK we are only using the shade plug-in to deal with conflict > >>>>>>>>>>> resolution in the assembly jar. These are dealt with in sbt via > >> the > >>>>>>>>>>> sbt assembly plug-in in an identical way. Is there a difference? > >>>>>>>>>> > >>>>>>>>>> I am bringing up the Sharder, because it is an awful hack, which is > >>>>>>>> can't > >>>>>>>>>> be > >>>>>>>>>> used in real controlled deployment. > >>>>>>>>>> > >>>>>>>>>> Cos > >>>>>>>>>> > >>>>>>>>>>> [1] > >>>>>>>> > >> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> -- > >>>>>>>> Evan Chan > >>>>>>>> Staff Engineer > >>>>>>>> e...@ooyala.com | > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> -- > >>>>>> Evan Chan > >>>>>> Staff Engineer > >>>>>> e...@ooyala.com | > > > > > > > > -- > > -- > > Evan Chan > > Staff Engineer > > e...@ooyala.com |