Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

Evan Chan Wed, 26 Feb 2014 11:35:46 -0800

Mark,

No, I haven't tried this myself yet  :-p   Also I would expect that
sbt-pom-reader does not do assemblies at all .... because that is an
SBT plugin, so we would still need code to include sbt-assembly.
There is also the trick question of how to include the assembly stuff
into sbt-pom-reader generated projects.  So, needs much more
investigation.....


My hunch is that it's easier to generate the pom from SBT (make-pom)
than the other way around.

On Wed, Feb 26, 2014 at 10:54 AM, Mark Hamstra <m...@clearstorydata.com> wrote:
> Evan,
>
> Have you actually tried to build Spark using its POM file and sbt-pom-reader?
>  I just made a first, naive attempt, and I'm still sorting through just
> what this did and didn't produce.  It looks like the basic jar files are at
> least very close to correct, and may be just fine, but that building the
> assembly jars failed completely.
>
> It's not completely obvious to me how to proceed with what sbt-pom-reader
> produces in order build the assemblies, run the test suites, etc., so I'm
> wondering if you have already worked out what that requires?
>
>
> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote:
>
>> I'd like to propose the following way to move forward, based on the
>> comments I've seen:
>>
>> 1.  Aggressively clean up the giant dependency graph.   One ticket I
>> might work on if I have time is SPARK-681 which might remove the giant
>> fastutil dependency (~15MB by itself).
>>
>> 2.  Take an intermediate step by having only ONE source of truth
>> w.r.t. dependencies and versions.  This means either:
>>    a)  Using a maven POM as the spec for dependencies, Hadoop version,
>> etc.   Then, use sbt-pom-reader to import it.
>>    b)  Using the build.scala as the spec, and "sbt make-pom" to
>> generate the pom.xml for the dependencies
>>
>>     The idea is to remove the pain and errors associated with manual
>> translation of dependency specs from one system to another, while
>> still maintaining the things which are hard to translate (plugins).
>>
>>
>> On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com> wrote:
>> > We maintain in house spark build using sbt. We have no problem using sbt
>> > assembly. We did add a few exclude statements for transitive
>> dependencies.
>> >
>> > The main enemy of assemblies are jars that include stuff they shouldn't
>> > (kryo comes to mind, I think they include logback?), new versions of jars
>> > that change the provider/artifact without changing the package (asm), and
>> > incompatible new releases (protobuf). These break the transitive
>> resolution
>> > process. I imagine that's true for any build tool.
>> >
>> > Besides shading I don't see anything maven can do sbt cannot, and if I
>> > understand it correctly shading is not done currently using the build
>> tool.
>> >
>> > Since spark is primarily scala/akka based the main developer base will be
>> > familiar with sbt (I think?). Switching build tool is always painful. I
>> > personally think it is smarter to put this burden on a limited number of
>> > upstream integrators than on the community. However that said I don't
>> think
>> > its a problem for us to maintain an sbt build in-house if spark switched
>> to
>> > maven.
>> > The problem is, the complete spark dependency graph is fairly large,
>> > and there are lot of conflicting versions in there.
>> > In particular, when we bump versions of dependencies - making managing
>> > this messy at best.
>> >
>> > Now, I have not looked in detail at how maven manages this - it might
>> > just be accidental that we get a decent out-of-the-box assembled
>> > shaded jar (since we dont do anything great to configure it).
>> > With current state of sbt in spark, it definitely is not a good
>> > solution : if we can enhance it (or it already is ?), while keeping
>> > the management of the version/dependency graph manageable, I dont have
>> > any objections to using sbt or maven !
>> > Too many exclude versions, pinned versions, etc would just make things
>> > unmanageable in future.
>> >
>> >
>> > Regards,
>> > Mridul
>> >
>> >
>> >
>> >
>> > On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote:
>> >> Actually you can control exactly how sbt assembly merges or resolves
>> > conflicts.  I believe the default settings however lead to order which
>> > cannot be controlled.
>> >>
>> >> I do wish for a smarter fat jar plugin.
>> >>
>> >> -Evan
>> >> To be free is not merely to cast off one's chains, but to live in a way
>> > that respects & enhances the freedom of others. (#NelsonMandela)
>> >>
>> >>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <mri...@gmail.com>
>> > wrote:
>> >>>
>> >>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <pwend...@gmail.com>
>> > wrote:
>> >>>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
>> >>>> right now we don't actually use it for bytecode shading - we simply
>> >>>> use it for creating the uber jar with excludes (which sbt supports
>> >>>> just fine via assembly).
>> >>>
>> >>>
>> >>> Not really - as I mentioned initially in this thread, sbt's assembly
>> >>> does not take dependencies into account properly : and can overwrite
>> >>> newer classes with older versions.
>> >>> From an assembly point of view, sbt is not very good : we are yet to
>> >>> try it after 2.10 shift though (and probably wont, given the mess it
>> >>> created last time).
>> >>>
>> >>> Regards,
>> >>> Mridul
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>> I was wondering actually, do you know if it's possible to added shaded
>> >>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
>> >>>> jar)? That's something I could see being really handy in the future.
>> >>>>
>> >>>> - Patrick
>> >>>>
>> >>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com> wrote:
>> >>>>> The problem is that plugins are not equivalent.  There is AFAIK no
>> >>>>> equivalent to the maven shader plugin for SBT.
>> >>>>> There is an SBT plugin which can apparently read POM XML files
>> >>>>> (sbt-pom-reader).   However, it can't possibly handle plugins, which
>> >>>>> is still problematic.
>> >>>>>
>> >>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com> wrote:
>> >>>>>> I would prefer keep both of them, it would be better even if that
>> > means
>> >>>>>> pom.xml will be generated using sbt. Some company, like my current
>> > one,
>> >>>>>> have their own build infrastructures built on top of maven. It is
>> not
>> > easy
>> >>>>>> to support sbt for these potential spark clients. But I do agree to
>> > only
>> >>>>>> keep one if there is a promising way to generate correct
>> > configuration from
>> >>>>>> the other.
>> >>>>>>
>> >>>>>> -Shengzhe
>> >>>>>>
>> >>>>>>
>> >>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com> wrote:
>> >>>>>>>
>> >>>>>>> The correct way to exclude dependencies in SBT is actually to
>> declare
>> >>>>>>> a dependency as "provided".   I'm not familiar with Maven or its
>> >>>>>>> dependencySet, but provided will mark the entire dependency tree as
>> >>>>>>> excluded.   It is also possible to exclude jar by jar, but this is
>> >>>>>>> pretty error prone and messy.
>> >>>>>>>
>> >>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers <ko...@tresata.com
>> >
>> > wrote:
>> >>>>>>>> yes in sbt assembly you can exclude jars (although i never had a
>> > need for
>> >>>>>>>> this) and files in jars.
>> >>>>>>>>
>> >>>>>>>> for example i frequently remove log4j.properties, because for
>> > whatever
>> >>>>>>>> reason hadoop decided to include it making it very difficult to
>> use
>> > our
>> >>>>>>> own
>> >>>>>>>> logging config.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik <
>> c...@apache.org
>> >>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>> >>>>>>>>>> Kos - thanks for chiming in. Could you be more specific about
>> > what is
>> >>>>>>>>>> available in maven and not in sbt for these issues? I took a
>> look
>> > at
>> >>>>>>>>>> the bigtop code relating to Spark. As far as I could tell [1]
>> was
>> > the
>> >>>>>>>>>> main point of integration with the build system (maybe there are
>> > other
>> >>>>>>>>>> integration points)?
>> >>>>>>>>>>
>> >>>>>>>>>>>  - in order to integrate Spark well into existing Hadoop stack
>> it
>> >>>>>>> was
>> >>>>>>>>>>>    necessary to have a way to avoid transitive dependencies
>> >>>>>>>>> duplications and
>> >>>>>>>>>>>    possible conflicts.
>> >>>>>>>>>>>
>> >>>>>>>>>>>    E.g. Maven assembly allows us to avoid adding _all_ Hadoop
>> > libs
>> >>>>>>>>> and later
>> >>>>>>>>>>>    merely declare Spark package dependency on standard Bigtop
>> >>>>>>> Hadoop
>> >>>>>>>>>>>    packages. And yes - Bigtop packaging means the naming and
>> > layout
>> >>>>>>>>> would be
>> >>>>>>>>>>>    standard across all commercial Hadoop distributions that are
>> >>>>>>> worth
>> >>>>>>>>>>>    mentioning: ASF Bigtop convenience binary packages, and
>> >>>>>>> Cloudera or
>> >>>>>>>>>>>    Hortonworks packages. Hence, the downstream user doesn't
>> need
>> > to
>> >>>>>>>>> spend any
>> >>>>>>>>>>>    effort to make sure that Spark "clicks-in" properly.
>> >>>>>>>>>>
>> >>>>>>>>>> The sbt build also allows you to plug in a Hadoop version
>> similar
>> > to
>> >>>>>>>>>> the maven build.
>> >>>>>>>>>
>> >>>>>>>>> I am actually talking about an ability to exclude a set of
>> > dependencies
>> >>>>>>>>> from an
>> >>>>>>>>> assembly, similarly to what's happening in dependencySet sections
>> > of
>> >>>>>>>>>    assembly/src/main/assembly/assembly.xml
>> >>>>>>>>> If there is a comparable functionality in Sbt, that would help
>> > quite a
>> >>>>>>> bit,
>> >>>>>>>>> apparently.
>> >>>>>>>>>
>> >>>>>>>>> Cos
>> >>>>>>>>>
>> >>>>>>>>>>>  - Maven provides a relatively easy way to deal with the
>> jar-hell
>> >>>>>>>>> problem,
>> >>>>>>>>>>>    although the original maven build was just Shader'ing
>> > everything
>> >>>>>>>>> into a
>> >>>>>>>>>>>    huge lump of class files. Oftentimes ending up with classes
>> >>>>>>>>> slamming on
>> >>>>>>>>>>>    top of each other from different transitive dependencies.
>> >>>>>>>>>>
>> >>>>>>>>>> AFIAK we are only using the shade plug-in to deal with conflict
>> >>>>>>>>>> resolution in the assembly jar. These are dealt with in sbt via
>> > the
>> >>>>>>>>>> sbt assembly plug-in in an identical way. Is there a difference?
>> >>>>>>>>>
>> >>>>>>>>> I am bringing up the Sharder, because it is an awful hack, which
>> is
>> >>>>>>> can't
>> >>>>>>>>> be
>> >>>>>>>>> used in real controlled deployment.
>> >>>>>>>>>
>> >>>>>>>>> Cos
>> >>>>>>>>>
>> >>>>>>>>>> [1]
>> >>>>>>>
>> >
>> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> --
>> >>>>>>> Evan Chan
>> >>>>>>> Staff Engineer
>> >>>>>>> e...@ooyala.com  |
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> --
>> >>>>> Evan Chan
>> >>>>> Staff Engineer
>> >>>>> e...@ooyala.com  |
>>
>>
>>
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com  |
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

Reply via email to

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark