Re: Fastest way to build Spark from scratch

Steve Loughran Tue, 08 Dec 2015 04:02:08 -0800

On 7 Dec 2015, at 19:07, Jakob Odersky 
<[email protected]<mailto:[email protected]>> wrote:


make-distribution and the second code snippet both create a distribution from a 
clean state. They therefore require that every source file be compiled and that 
takes time (you can maybe tweak some settings or use a newer compiler to gain 
some speed).

I'm inferring from your question that for your use-case deployment speed is a 
critical issue, furthermore you'd like to build Spark for lots of (every?) 
commit in a systematic way. In that case I would suggest you try using the 
second code snippet without the `clean` task and only resort to it if the build 
fails.

On my local machine, an assembly without a clean drops from 6 minutes to 2.

regards,
--Jakob

1. you can use zinc -where possible- to speed up scala compilations
2. you might also consider setting up a local jenkins VM, hooked to whatever 
git repo & branch you are working off, and have it do the builds and tests for 
you. Not so great for interactive dev,

finally, on the mac, the "say" command is pretty handy at letting you know when 
some work in a terminal is ready, so you can do the first-thing-in-the morning 
build-of-the-SNAPSHOTS

mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo

After that you can work on the modules you care about (via the -pl) option). 
That doesn't work if you are running on an EC2 instance though




On 23 November 2015 at 20:18, Nicholas Chammas 
<[email protected]<mailto:[email protected]>> wrote:

Say I want to build a complete Spark distribution against Hadoop 2.6+ as fast 
as possible from scratch.

This is what I’m doing at the moment:

./make-distribution.sh -T 1C -Phadoop-2.6


-T 1C instructs Maven to spin up 1 thread per available core. This takes around 
20 minutes on an m3.large instance.

I see that spark-ec2, on the other hand, builds Spark as 
follows<https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
 when you deploy Spark at a specific git commit:

sbt/sbt clean assembly
sbt/sbt publish-local


This seems slower than using make-distribution.sh, actually.

Is there a faster way to do this?

Nick

Re: Fastest way to build Spark from scratch

Reply via email to