How to efficiently control concurrent Spark jobs

Staffan Wed, 25 Feb 2015 03:08:11 -0800

Hi,
Is there a good way (recommended way) to control and run multiple Spark jobs
within the same application? My application is like follows;

1) Run one Spark job on a 'ful' dataset, which then creates a few thousands
of RDDs containing sub-datasets from the complete dataset. Each of the
sub-datasets are independent from the others (the 'ful' dataset is simply a
dump from a database containing several different types of records).
2) Run some filtration and manipulations on each of the RDD and finally do
some ML on the data. (Each of the created RDD's from step 1) is completely
independent so this should be run concurrently).

I've implemented this by using Scala Futures and executing the Spark jobs in
2) from a separate thread for each RDD. This works and improves runtime
compared to a naive for-loop over step 2). Scaling is however not as good as
I would expect it to be. (28 minutes for 4 cores on 1 machine -> 19 minutes
for 12 cores on 3 machines).

Each of the sub-datasets are fairly small so I've used 'repartition' and
'cache' to store the sub-datasets on only one machine in step 1), this
improved runtime a few %.

So, either do anyone have a suggestion of how to do this in a better way or
perhaps if there a higher level workflow tool that I can use on top of
Spark? (The cool solution would have been to use nestled RDDs and just map
over them in a high level way, but as this is not supported afaik).

Thanks!
Staffan

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-control-concurrent-Spark-jobs-tp21800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to efficiently control concurrent Spark jobs

Reply via email to