Re: Fastest way to build Spark from scratch

Stephen Boesch Tue, 08 Dec 2015 09:55:39 -0800

I will echo Steve L's comment about having zinc running (with --nailed).
That provides at least a 2X speedup - sometimes without it spark simply
does not build for me.


2015-12-08 9:33 GMT-08:00 Josh Rosen <[email protected]>:

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> [email protected]> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <[email protected]>
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky <[email protected]> wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> [email protected]> wrote:
>>>
>>>> Say I want to build a complete Spark distribution against Hadoop 2.6+
>>>> as fast as possible from scratch.
>>>>
>>>> This is what I’m doing at the moment:
>>>>
>>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>>
>>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>>> takes around 20 minutes on an m3.large instance.
>>>>
>>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>>> when you deploy Spark at a specific git commit:
>>>>
>>>> sbt/sbt clean assembly
>>>> sbt/sbt publish-local
>>>>
>>>> This seems slower than using make-distribution.sh, actually.
>>>>
>>>> Is there a faster way to do this?
>>>>
>>>> Nick
>>>> 
>>>>
>>>
>>>
>>>
>

Re: Fastest way to build Spark from scratch

Reply via email to