Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

Mark Hamstra Fri, 28 Feb 2014 15:11:20 -0800

Couple of comments: 1) Whether the Spark POM is produced by SBT or Maven
shouldn't matter for those who just need to link against published
artifacts, but right now SBT and Maven do not produce equivalent POMs for
Spark -- I think....  2) Incremental builds using Maven are trivially more
difficult than they are with SBT -- just start a Zinc daemon and forget
about it.



On Fri, Feb 28, 2014 at 12:35 PM, Patrick Wendell <pwend...@gmail.com>wrote:

> Hey,
>
> Thanks everyone for chiming in on this. I wanted to summarize these
> issues a bit particularly wrt the constituents involved - does this
> seem accurate?
>
> = Spark Users =
> In general those linking against Spark should be totally unaffected by
> the build choice. Spark will continue to publish well-formed poms and
> jars to maven central. This is a no-op wrt this decision.
>
> = Spark Developers =
> There are two concerns. (a) General day-to-day development and
> packaging and (b) Spark binaries and packages for distribution.
>
> For (a) - sbt seems better because it's just nicer for doing scala
> development (incremental complication is simple, we have some
> home-baked tools for compiling Spark vs. the spark deps etc). The
> arguments that maven has more "general know how", at least so far,
> haven't affected us in the ~2 years we've maintained both builds -
> where adding stuff for Maven is typically just as annoying/difficult
> with sbt.
>
> For (b) - Some non-specific concerns were raised about bugs with the
> sbt assembly package - we should look into this and see what is going
> on. Maven has better out-of-the-box support for publishing to Maven
> central, we'd have to do some manual work on our end to make this work
> well with sbt.
>
> = Downstream Integrators =
> On this one it seems that Maven is the universal favorite, largely
> because of community awareness of Maven and comfort with Maven builds.
> Some things like restructuring the Spark build to inherit config
> values from a vendor build will be not possible with sbt (though
> fairly straightforward to work around). Other cases where vendors have
> directly modified or inherited the Spark build won't work anymore if
> we standardize on SBT. These have no obvious work around at this point
> as far as I see.
>
> - Patrick
>
> On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan <mri...@gmail.com>
> wrote:
> > On Feb 26, 2014 11:12 PM, "Patrick Wendell" <pwend...@gmail.com> wrote:
> >>
> >> @mridul - As far as I know both Maven and Sbt use fairly similar
> >> processes for building the assembly/uber jar. We actually used to
> >> package spark with sbt and there were no specific issues we
> >> encountered and AFAIK sbt respects versioning of transitive
> >> dependencies correctly. Do you have a specific bug listing for sbt
> >> that indicates something is broken?
> >
> > Slightly longish ...
> >
> > The assembled jar, generated via sbt broke all over the place while I was
> > adding yarn support in 0.6 - and I had to fix sbt project a fair bit to
> get
> > it to work : we need the assembled jar to submit a yarn job.
> >
> > When I finally submitted those changes to 0.7, it broke even more - since
> > dependencies changed : someone else had thankfully already added maven
> > support by then - which worked remarkably well out of the box (with some
> > minor tweaks) !
> >
> > In theory, they might be expected to work the same, but practically they
> > did not : as I mentioned,  it must just have been luck that maven worked
> > that well; but given multiple past nasty experiences with sbt, and the
> fact
> > that it does not bring anything compelling or new in contrast, I am
> fairly
> > against the idea of using only sbt - inspite of maven being unintuitive
> at
> > times.
> >
> > Regards,
> > Mridul
> >
> >>
> >> @sandy - It sounds like you are saying that the CDH build would be
> >> easier with Maven because you can inherit the POM. However, is this
> >> just a matter of convenience for packagers or would standardizing on
> >> sbt limit capabilities in some way? I assume that it would just mean a
> >> bit more manual work for packagers having to figure out how to set the
> >> hadoop version in SBT and exclude certain dependencies. For instance,
> >> what does CDH about other components like Impala that are not based on
> >> Maven at all?
> >>
> >> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote:
> >> > I'd like to propose the following way to move forward, based on the
> >> > comments I've seen:
> >> >
> >> > 1.  Aggressively clean up the giant dependency graph.   One ticket I
> >> > might work on if I have time is SPARK-681 which might remove the giant
> >> > fastutil dependency (~15MB by itself).
> >> >
> >> > 2.  Take an intermediate step by having only ONE source of truth
> >> > w.r.t. dependencies and versions.  This means either:
> >> >    a)  Using a maven POM as the spec for dependencies, Hadoop version,
> >> > etc.   Then, use sbt-pom-reader to import it.
> >> >    b)  Using the build.scala as the spec, and "sbt make-pom" to
> >> > generate the pom.xml for the dependencies
> >> >
> >> >     The idea is to remove the pain and errors associated with manual
> >> > translation of dependency specs from one system to another, while
> >> > still maintaining the things which are hard to translate (plugins).
> >> >
> >> >
> >> > On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com>
> > wrote:
> >> >> We maintain in house spark build using sbt. We have no problem using
> > sbt
> >> >> assembly. We did add a few exclude statements for transitive
> > dependencies.
> >> >>
> >> >> The main enemy of assemblies are jars that include stuff they
> shouldn't
> >> >> (kryo comes to mind, I think they include logback?), new versions of
> > jars
> >> >> that change the provider/artifact without changing the package (asm),
> > and
> >> >> incompatible new releases (protobuf). These break the transitive
> > resolution
> >> >> process. I imagine that's true for any build tool.
> >> >>
> >> >> Besides shading I don't see anything maven can do sbt cannot, and if
> I
> >> >> understand it correctly shading is not done currently using the build
> > tool.
> >> >>
> >> >> Since spark is primarily scala/akka based the main developer base
> will
> > be
> >> >> familiar with sbt (I think?). Switching build tool is always
> painful. I
> >> >> personally think it is smarter to put this burden on a limited number
> > of
> >> >> upstream integrators than on the community. However that said I don't
> > think
> >> >> its a problem for us to maintain an sbt build in-house if spark
> > switched to
> >> >> maven.
> >> >> The problem is, the complete spark dependency graph is fairly large,
> >> >> and there are lot of conflicting versions in there.
> >> >> In particular, when we bump versions of dependencies - making
> managing
> >> >> this messy at best.
> >> >>
> >> >> Now, I have not looked in detail at how maven manages this - it might
> >> >> just be accidental that we get a decent out-of-the-box assembled
> >> >> shaded jar (since we dont do anything great to configure it).
> >> >> With current state of sbt in spark, it definitely is not a good
> >> >> solution : if we can enhance it (or it already is ?), while keeping
> >> >> the management of the version/dependency graph manageable, I dont
> have
> >> >> any objections to using sbt or maven !
> >> >> Too many exclude versions, pinned versions, etc would just make
> things
> >> >> unmanageable in future.
> >> >>
> >> >>
> >> >> Regards,
> >> >> Mridul
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote:
> >> >>> Actually you can control exactly how sbt assembly merges or resolves
> >> >> conflicts.  I believe the default settings however lead to order
> which
> >> >> cannot be controlled.
> >> >>>
> >> >>> I do wish for a smarter fat jar plugin.
> >> >>>
> >> >>> -Evan
> >> >>> To be free is not merely to cast off one's chains, but to live in a
> > way
> >> >> that respects & enhances the freedom of others. (#NelsonMandela)
> >> >>>
> >> >>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <mri...@gmail.com
> >
> >> >> wrote:
> >> >>>>
> >> >>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <
> pwend...@gmail.com
> >>
> >> >> wrote:
> >> >>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
> >> >>>>> right now we don't actually use it for bytecode shading - we
> simply
> >> >>>>> use it for creating the uber jar with excludes (which sbt supports
> >> >>>>> just fine via assembly).
> >> >>>>
> >> >>>>
> >> >>>> Not really - as I mentioned initially in this thread, sbt's
> assembly
> >> >>>> does not take dependencies into account properly : and can
> overwrite
> >> >>>> newer classes with older versions.
> >> >>>> From an assembly point of view, sbt is not very good : we are yet
> to
> >> >>>> try it after 2.10 shift though (and probably wont, given the mess
> it
> >> >>>> created last time).
> >> >>>>
> >> >>>> Regards,
> >> >>>> Mridul
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>>
> >> >>>>> I was wondering actually, do you know if it's possible to added
> > shaded
> >> >>>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
> >> >>>>> jar)? That's something I could see being really handy in the
> future.
> >> >>>>>
> >> >>>>> - Patrick
> >> >>>>>
> >> >>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com>
> wrote:
> >> >>>>>> The problem is that plugins are not equivalent.  There is AFAIK
> no
> >> >>>>>> equivalent to the maven shader plugin for SBT.
> >> >>>>>> There is an SBT plugin which can apparently read POM XML files
> >> >>>>>> (sbt-pom-reader).   However, it can't possibly handle plugins,
> > which
> >> >>>>>> is still problematic.
> >> >>>>>>
> >> >>>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com>
> > wrote:
> >> >>>>>>> I would prefer keep both of them, it would be better even if
> that
> >> >> means
> >> >>>>>>> pom.xml will be generated using sbt. Some company, like my
> current
> >> >> one,
> >> >>>>>>> have their own build infrastructures built on top of maven. It
> is
> > not
> >> >> easy
> >> >>>>>>> to support sbt for these potential spark clients. But I do agree
> > to
> >> >> only
> >> >>>>>>> keep one if there is a promising way to generate correct
> >> >> configuration from
> >> >>>>>>> the other.
> >> >>>>>>>
> >> >>>>>>> -Shengzhe
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com>
> wrote:
> >> >>>>>>>>
> >> >>>>>>>> The correct way to exclude dependencies in SBT is actually to
> > declare
> >> >>>>>>>> a dependency as "provided".   I'm not familiar with Maven or
> its
> >> >>>>>>>> dependencySet, but provided will mark the entire dependency
> tree
> > as
> >> >>>>>>>> excluded.   It is also possible to exclude jar by jar, but this
> > is
> >> >>>>>>>> pretty error prone and messy.
> >> >>>>>>>>
> >> >>>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers <
> > ko...@tresata.com>
> >> >> wrote:
> >> >>>>>>>>> yes in sbt assembly you can exclude jars (although i never
> had a
> >> >> need for
> >> >>>>>>>>> this) and files in jars.
> >> >>>>>>>>>
> >> >>>>>>>>> for example i frequently remove log4j.properties, because for
> >> >> whatever
> >> >>>>>>>>> reason hadoop decided to include it making it very difficult
> to
> > use
> >> >> our
> >> >>>>>>>> own
> >> >>>>>>>>> logging config.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik <
> > c...@apache.org
> >> >>>
> >> >>>>>>>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
> >> >>>>>>>>>>> Kos - thanks for chiming in. Could you be more specific
> about
> >> >> what is
> >> >>>>>>>>>>> available in maven and not in sbt for these issues? I took a
> > look
> >> >> at
> >> >>>>>>>>>>> the bigtop code relating to Spark. As far as I could tell
> [1]
> > was
> >> >> the
> >> >>>>>>>>>>> main point of integration with the build system (maybe there
> > are
> >> >> other
> >> >>>>>>>>>>> integration points)?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>  - in order to integrate Spark well into existing Hadoop
> > stack it
> >> >>>>>>>> was
> >> >>>>>>>>>>>>    necessary to have a way to avoid transitive dependencies
> >> >>>>>>>>>> duplications and
> >> >>>>>>>>>>>>    possible conflicts.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>    E.g. Maven assembly allows us to avoid adding _all_
> Hadoop
> >> >> libs
> >> >>>>>>>>>> and later
> >> >>>>>>>>>>>>    merely declare Spark package dependency on standard
> Bigtop
> >> >>>>>>>> Hadoop
> >> >>>>>>>>>>>>    packages. And yes - Bigtop packaging means the naming
> and
> >> >> layout
> >> >>>>>>>>>> would be
> >> >>>>>>>>>>>>    standard across all commercial Hadoop distributions that
> > are
> >> >>>>>>>> worth
> >> >>>>>>>>>>>>    mentioning: ASF Bigtop convenience binary packages, and
> >> >>>>>>>> Cloudera or
> >> >>>>>>>>>>>>    Hortonworks packages. Hence, the downstream user doesn't
> > need
> >> >> to
> >> >>>>>>>>>> spend any
> >> >>>>>>>>>>>>    effort to make sure that Spark "clicks-in" properly.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> The sbt build also allows you to plug in a Hadoop version
> > similar
> >> >> to
> >> >>>>>>>>>>> the maven build.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I am actually talking about an ability to exclude a set of
> >> >> dependencies
> >> >>>>>>>>>> from an
> >> >>>>>>>>>> assembly, similarly to what's happening in dependencySet
> > sections
> >> >> of
> >> >>>>>>>>>>    assembly/src/main/assembly/assembly.xml
> >> >>>>>>>>>> If there is a comparable functionality in Sbt, that would
> help
> >> >> quite a
> >> >>>>>>>> bit,
> >> >>>>>>>>>> apparently.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Cos
> >> >>>>>>>>>>
> >> >>>>>>>>>>>>  - Maven provides a relatively easy way to deal with the
> > jar-hell
> >> >>>>>>>>>> problem,
> >> >>>>>>>>>>>>    although the original maven build was just Shader'ing
> >> >> everything
> >> >>>>>>>>>> into a
> >> >>>>>>>>>>>>    huge lump of class files. Oftentimes ending up with
> > classes
> >> >>>>>>>>>> slamming on
> >> >>>>>>>>>>>>    top of each other from different transitive
> dependencies.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> AFIAK we are only using the shade plug-in to deal with
> > conflict
> >> >>>>>>>>>>> resolution in the assembly jar. These are dealt with in sbt
> > via
> >> >> the
> >> >>>>>>>>>>> sbt assembly plug-in in an identical way. Is there a
> > difference?
> >> >>>>>>>>>>
> >> >>>>>>>>>> I am bringing up the Sharder, because it is an awful hack,
> > which is
> >> >>>>>>>> can't
> >> >>>>>>>>>> be
> >> >>>>>>>>>> used in real controlled deployment.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Cos
> >> >>>>>>>>>>
> >> >>>>>>>>>>> [1]
> >> >>>>>>>>
> >> >>
> >
> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> --
> >> >>>>>>>> --
> >> >>>>>>>> Evan Chan
> >> >>>>>>>> Staff Engineer
> >> >>>>>>>> e...@ooyala.com  |
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> --
> >> >>>>>> Evan Chan
> >> >>>>>> Staff Engineer
> >> >>>>>> e...@ooyala.com  |
> >> >
> >> >
> >> >
> >> > --
> >> > --
> >> > Evan Chan
> >> > Staff Engineer
> >> > e...@ooyala.com  |
>

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

Reply via email to

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark