We maintain in house spark build using sbt. We have no problem using sbt
assembly. We did add a few exclude statements for transitive dependencies.

The main enemy of assemblies are jars that include stuff they shouldn't
(kryo comes to mind, I think they include logback?), new versions of jars
that change the provider/artifact without changing the package (asm), and
incompatible new releases (protobuf). These break the transitive resolution
process. I imagine that's true for any build tool.

Besides shading I don't see anything maven can do sbt cannot, and if I
understand it correctly shading is not done currently using the build tool.

Since spark is primarily scala/akka based the main developer base will be
familiar with sbt (I think?). Switching build tool is always painful. I
personally think it is smarter to put this burden on a limited number of
upstream integrators than on the community. However that said I don't think
its a problem for us to maintain an sbt build in-house if spark switched to
maven.
The problem is, the complete spark dependency graph is fairly large,
and there are lot of conflicting versions in there.
In particular, when we bump versions of dependencies - making managing
this messy at best.

Now, I have not looked in detail at how maven manages this - it might
just be accidental that we get a decent out-of-the-box assembled
shaded jar (since we dont do anything great to configure it).
With current state of sbt in spark, it definitely is not a good
solution : if we can enhance it (or it already is ?), while keeping
the management of the version/dependency graph manageable, I dont have
any objections to using sbt or maven !
Too many exclude versions, pinned versions, etc would just make things
unmanageable in future.


Regards,
Mridul




On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote:
> Actually you can control exactly how sbt assembly merges or resolves
conflicts.  I believe the default settings however lead to order which
cannot be controlled.
>
> I do wish for a smarter fat jar plugin.
>
> -Evan
> To be free is not merely to cast off one's chains, but to live in a way
that respects & enhances the freedom of others. (#NelsonMandela)
>
>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:
>>
>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <pwend...@gmail.com>
wrote:
>>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
>>> right now we don't actually use it for bytecode shading - we simply
>>> use it for creating the uber jar with excludes (which sbt supports
>>> just fine via assembly).
>>
>>
>> Not really - as I mentioned initially in this thread, sbt's assembly
>> does not take dependencies into account properly : and can overwrite
>> newer classes with older versions.
>> From an assembly point of view, sbt is not very good : we are yet to
>> try it after 2.10 shift though (and probably wont, given the mess it
>> created last time).
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>>
>>>
>>> I was wondering actually, do you know if it's possible to added shaded
>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
>>> jar)? That's something I could see being really handy in the future.
>>>
>>> - Patrick
>>>
>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com> wrote:
>>>> The problem is that plugins are not equivalent.  There is AFAIK no
>>>> equivalent to the maven shader plugin for SBT.
>>>> There is an SBT plugin which can apparently read POM XML files
>>>> (sbt-pom-reader).   However, it can't possibly handle plugins, which
>>>> is still problematic.
>>>>
>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com> wrote:
>>>>> I would prefer keep both of them, it would be better even if that
means
>>>>> pom.xml will be generated using sbt. Some company, like my current
one,
>>>>> have their own build infrastructures built on top of maven. It is not
easy
>>>>> to support sbt for these potential spark clients. But I do agree to
only
>>>>> keep one if there is a promising way to generate correct
configuration from
>>>>> the other.
>>>>>
>>>>> -Shengzhe
>>>>>
>>>>>
>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com> wrote:
>>>>>>
>>>>>> The correct way to exclude dependencies in SBT is actually to declare
>>>>>> a dependency as "provided".   I'm not familiar with Maven or its
>>>>>> dependencySet, but provided will mark the entire dependency tree as
>>>>>> excluded.   It is also possible to exclude jar by jar, but this is
>>>>>> pretty error prone and messy.
>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers <ko...@tresata.com>
wrote:
>>>>>>> yes in sbt assembly you can exclude jars (although i never had a
need for
>>>>>>> this) and files in jars.
>>>>>>>
>>>>>>> for example i frequently remove log4j.properties, because for
whatever
>>>>>>> reason hadoop decided to include it making it very difficult to use
our
>>>>>> own
>>>>>>> logging config.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik <c...@apache.org
>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>>>>>>>>> Kos - thanks for chiming in. Could you be more specific about
what is
>>>>>>>>> available in maven and not in sbt for these issues? I took a look
at
>>>>>>>>> the bigtop code relating to Spark. As far as I could tell [1] was
the
>>>>>>>>> main point of integration with the build system (maybe there are
other
>>>>>>>>> integration points)?
>>>>>>>>>
>>>>>>>>>>  - in order to integrate Spark well into existing Hadoop stack it
>>>>>> was
>>>>>>>>>>    necessary to have a way to avoid transitive dependencies
>>>>>>>> duplications and
>>>>>>>>>>    possible conflicts.
>>>>>>>>>>
>>>>>>>>>>    E.g. Maven assembly allows us to avoid adding _all_ Hadoop
libs
>>>>>>>> and later
>>>>>>>>>>    merely declare Spark package dependency on standard Bigtop
>>>>>> Hadoop
>>>>>>>>>>    packages. And yes - Bigtop packaging means the naming and
layout
>>>>>>>> would be
>>>>>>>>>>    standard across all commercial Hadoop distributions that are
>>>>>> worth
>>>>>>>>>>    mentioning: ASF Bigtop convenience binary packages, and
>>>>>> Cloudera or
>>>>>>>>>>    Hortonworks packages. Hence, the downstream user doesn't need
to
>>>>>>>> spend any
>>>>>>>>>>    effort to make sure that Spark "clicks-in" properly.
>>>>>>>>>
>>>>>>>>> The sbt build also allows you to plug in a Hadoop version similar
to
>>>>>>>>> the maven build.
>>>>>>>>
>>>>>>>> I am actually talking about an ability to exclude a set of
dependencies
>>>>>>>> from an
>>>>>>>> assembly, similarly to what's happening in dependencySet sections
of
>>>>>>>>    assembly/src/main/assembly/assembly.xml
>>>>>>>> If there is a comparable functionality in Sbt, that would help
quite a
>>>>>> bit,
>>>>>>>> apparently.
>>>>>>>>
>>>>>>>> Cos
>>>>>>>>
>>>>>>>>>>  - Maven provides a relatively easy way to deal with the jar-hell
>>>>>>>> problem,
>>>>>>>>>>    although the original maven build was just Shader'ing
everything
>>>>>>>> into a
>>>>>>>>>>    huge lump of class files. Oftentimes ending up with classes
>>>>>>>> slamming on
>>>>>>>>>>    top of each other from different transitive dependencies.
>>>>>>>>>
>>>>>>>>> AFIAK we are only using the shade plug-in to deal with conflict
>>>>>>>>> resolution in the assembly jar. These are dealt with in sbt via
the
>>>>>>>>> sbt assembly plug-in in an identical way. Is there a difference?
>>>>>>>>
>>>>>>>> I am bringing up the Sharder, because it is an awful hack, which is
>>>>>> can't
>>>>>>>> be
>>>>>>>> used in real controlled deployment.
>>>>>>>>
>>>>>>>> Cos
>>>>>>>>
>>>>>>>>> [1]
>>>>>>
https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Evan Chan
>>>>>> Staff Engineer
>>>>>> e...@ooyala.com  |
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Evan Chan
>>>> Staff Engineer
>>>> e...@ooyala.com  |

Reply via email to