Mark, No, I haven't tried this myself yet :-p Also I would expect that sbt-pom-reader does not do assemblies at all .... because that is an SBT plugin, so we would still need code to include sbt-assembly. There is also the trick question of how to include the assembly stuff into sbt-pom-reader generated projects. So, needs much more investigation.....
My hunch is that it's easier to generate the pom from SBT (make-pom) than the other way around. On Wed, Feb 26, 2014 at 10:54 AM, Mark Hamstra <m...@clearstorydata.com> wrote: > Evan, > > Have you actually tried to build Spark using its POM file and sbt-pom-reader? > I just made a first, naive attempt, and I'm still sorting through just > what this did and didn't produce. It looks like the basic jar files are at > least very close to correct, and may be just fine, but that building the > assembly jars failed completely. > > It's not completely obvious to me how to proceed with what sbt-pom-reader > produces in order build the assemblies, run the test suites, etc., so I'm > wondering if you have already worked out what that requires? > > > On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote: > >> I'd like to propose the following way to move forward, based on the >> comments I've seen: >> >> 1. Aggressively clean up the giant dependency graph. One ticket I >> might work on if I have time is SPARK-681 which might remove the giant >> fastutil dependency (~15MB by itself). >> >> 2. Take an intermediate step by having only ONE source of truth >> w.r.t. dependencies and versions. This means either: >> a) Using a maven POM as the spec for dependencies, Hadoop version, >> etc. Then, use sbt-pom-reader to import it. >> b) Using the build.scala as the spec, and "sbt make-pom" to >> generate the pom.xml for the dependencies >> >> The idea is to remove the pain and errors associated with manual >> translation of dependency specs from one system to another, while >> still maintaining the things which are hard to translate (plugins). >> >> >> On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com> wrote: >> > We maintain in house spark build using sbt. We have no problem using sbt >> > assembly. We did add a few exclude statements for transitive >> dependencies. >> > >> > The main enemy of assemblies are jars that include stuff they shouldn't >> > (kryo comes to mind, I think they include logback?), new versions of jars >> > that change the provider/artifact without changing the package (asm), and >> > incompatible new releases (protobuf). These break the transitive >> resolution >> > process. I imagine that's true for any build tool. >> > >> > Besides shading I don't see anything maven can do sbt cannot, and if I >> > understand it correctly shading is not done currently using the build >> tool. >> > >> > Since spark is primarily scala/akka based the main developer base will be >> > familiar with sbt (I think?). Switching build tool is always painful. I >> > personally think it is smarter to put this burden on a limited number of >> > upstream integrators than on the community. However that said I don't >> think >> > its a problem for us to maintain an sbt build in-house if spark switched >> to >> > maven. >> > The problem is, the complete spark dependency graph is fairly large, >> > and there are lot of conflicting versions in there. >> > In particular, when we bump versions of dependencies - making managing >> > this messy at best. >> > >> > Now, I have not looked in detail at how maven manages this - it might >> > just be accidental that we get a decent out-of-the-box assembled >> > shaded jar (since we dont do anything great to configure it). >> > With current state of sbt in spark, it definitely is not a good >> > solution : if we can enhance it (or it already is ?), while keeping >> > the management of the version/dependency graph manageable, I dont have >> > any objections to using sbt or maven ! >> > Too many exclude versions, pinned versions, etc would just make things >> > unmanageable in future. >> > >> > >> > Regards, >> > Mridul >> > >> > >> > >> > >> > On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote: >> >> Actually you can control exactly how sbt assembly merges or resolves >> > conflicts. I believe the default settings however lead to order which >> > cannot be controlled. >> >> >> >> I do wish for a smarter fat jar plugin. >> >> >> >> -Evan >> >> To be free is not merely to cast off one's chains, but to live in a way >> > that respects & enhances the freedom of others. (#NelsonMandela) >> >> >> >>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <mri...@gmail.com> >> > wrote: >> >>> >> >>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <pwend...@gmail.com> >> > wrote: >> >>>> Evan - this is a good thing to bring up. Wrt the shader plug-in - >> >>>> right now we don't actually use it for bytecode shading - we simply >> >>>> use it for creating the uber jar with excludes (which sbt supports >> >>>> just fine via assembly). >> >>> >> >>> >> >>> Not really - as I mentioned initially in this thread, sbt's assembly >> >>> does not take dependencies into account properly : and can overwrite >> >>> newer classes with older versions. >> >>> From an assembly point of view, sbt is not very good : we are yet to >> >>> try it after 2.10 shift though (and probably wont, given the mess it >> >>> created last time). >> >>> >> >>> Regards, >> >>> Mridul >> >>> >> >>> >> >>> >> >>> >> >>> >> >>>> >> >>>> I was wondering actually, do you know if it's possible to added shaded >> >>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber >> >>>> jar)? That's something I could see being really handy in the future. >> >>>> >> >>>> - Patrick >> >>>> >> >>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com> wrote: >> >>>>> The problem is that plugins are not equivalent. There is AFAIK no >> >>>>> equivalent to the maven shader plugin for SBT. >> >>>>> There is an SBT plugin which can apparently read POM XML files >> >>>>> (sbt-pom-reader). However, it can't possibly handle plugins, which >> >>>>> is still problematic. >> >>>>> >> >>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com> wrote: >> >>>>>> I would prefer keep both of them, it would be better even if that >> > means >> >>>>>> pom.xml will be generated using sbt. Some company, like my current >> > one, >> >>>>>> have their own build infrastructures built on top of maven. It is >> not >> > easy >> >>>>>> to support sbt for these potential spark clients. But I do agree to >> > only >> >>>>>> keep one if there is a promising way to generate correct >> > configuration from >> >>>>>> the other. >> >>>>>> >> >>>>>> -Shengzhe >> >>>>>> >> >>>>>> >> >>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com> wrote: >> >>>>>>> >> >>>>>>> The correct way to exclude dependencies in SBT is actually to >> declare >> >>>>>>> a dependency as "provided". I'm not familiar with Maven or its >> >>>>>>> dependencySet, but provided will mark the entire dependency tree as >> >>>>>>> excluded. It is also possible to exclude jar by jar, but this is >> >>>>>>> pretty error prone and messy. >> >>>>>>> >> >>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers <ko...@tresata.com >> > >> > wrote: >> >>>>>>>> yes in sbt assembly you can exclude jars (although i never had a >> > need for >> >>>>>>>> this) and files in jars. >> >>>>>>>> >> >>>>>>>> for example i frequently remove log4j.properties, because for >> > whatever >> >>>>>>>> reason hadoop decided to include it making it very difficult to >> use >> > our >> >>>>>>> own >> >>>>>>>> logging config. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik < >> c...@apache.org >> >> >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote: >> >>>>>>>>>> Kos - thanks for chiming in. Could you be more specific about >> > what is >> >>>>>>>>>> available in maven and not in sbt for these issues? I took a >> look >> > at >> >>>>>>>>>> the bigtop code relating to Spark. As far as I could tell [1] >> was >> > the >> >>>>>>>>>> main point of integration with the build system (maybe there are >> > other >> >>>>>>>>>> integration points)? >> >>>>>>>>>> >> >>>>>>>>>>> - in order to integrate Spark well into existing Hadoop stack >> it >> >>>>>>> was >> >>>>>>>>>>> necessary to have a way to avoid transitive dependencies >> >>>>>>>>> duplications and >> >>>>>>>>>>> possible conflicts. >> >>>>>>>>>>> >> >>>>>>>>>>> E.g. Maven assembly allows us to avoid adding _all_ Hadoop >> > libs >> >>>>>>>>> and later >> >>>>>>>>>>> merely declare Spark package dependency on standard Bigtop >> >>>>>>> Hadoop >> >>>>>>>>>>> packages. And yes - Bigtop packaging means the naming and >> > layout >> >>>>>>>>> would be >> >>>>>>>>>>> standard across all commercial Hadoop distributions that are >> >>>>>>> worth >> >>>>>>>>>>> mentioning: ASF Bigtop convenience binary packages, and >> >>>>>>> Cloudera or >> >>>>>>>>>>> Hortonworks packages. Hence, the downstream user doesn't >> need >> > to >> >>>>>>>>> spend any >> >>>>>>>>>>> effort to make sure that Spark "clicks-in" properly. >> >>>>>>>>>> >> >>>>>>>>>> The sbt build also allows you to plug in a Hadoop version >> similar >> > to >> >>>>>>>>>> the maven build. >> >>>>>>>>> >> >>>>>>>>> I am actually talking about an ability to exclude a set of >> > dependencies >> >>>>>>>>> from an >> >>>>>>>>> assembly, similarly to what's happening in dependencySet sections >> > of >> >>>>>>>>> assembly/src/main/assembly/assembly.xml >> >>>>>>>>> If there is a comparable functionality in Sbt, that would help >> > quite a >> >>>>>>> bit, >> >>>>>>>>> apparently. >> >>>>>>>>> >> >>>>>>>>> Cos >> >>>>>>>>> >> >>>>>>>>>>> - Maven provides a relatively easy way to deal with the >> jar-hell >> >>>>>>>>> problem, >> >>>>>>>>>>> although the original maven build was just Shader'ing >> > everything >> >>>>>>>>> into a >> >>>>>>>>>>> huge lump of class files. Oftentimes ending up with classes >> >>>>>>>>> slamming on >> >>>>>>>>>>> top of each other from different transitive dependencies. >> >>>>>>>>>> >> >>>>>>>>>> AFIAK we are only using the shade plug-in to deal with conflict >> >>>>>>>>>> resolution in the assembly jar. These are dealt with in sbt via >> > the >> >>>>>>>>>> sbt assembly plug-in in an identical way. Is there a difference? >> >>>>>>>>> >> >>>>>>>>> I am bringing up the Sharder, because it is an awful hack, which >> is >> >>>>>>> can't >> >>>>>>>>> be >> >>>>>>>>> used in real controlled deployment. >> >>>>>>>>> >> >>>>>>>>> Cos >> >>>>>>>>> >> >>>>>>>>>> [1] >> >>>>>>> >> > >> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> -- >> >>>>>>> Evan Chan >> >>>>>>> Staff Engineer >> >>>>>>> e...@ooyala.com | >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> -- >> >>>>> Evan Chan >> >>>>> Staff Engineer >> >>>>> e...@ooyala.com | >> >> >> >> -- >> -- >> Evan Chan >> Staff Engineer >> e...@ooyala.com | >> -- -- Evan Chan Staff Engineer e...@ooyala.com |