subject:"How to efficiently control concurrent Spark jobs"

Re: How to efficiently control concurrent Spark jobs

2015-02-26 Thread Jeffrey Jedele

So basically you have lots of small ML tasks you want to run concurrently?

With I've used repartition and cache to store the sub-datasets on only one
machine you mean that you reduced each RDD to have one partition only?

Maybe you want to give the fair scheduler a try to get more of your tasks
executing concurrently. Just an idea...

Regards,
Jeff

2015-02-25 12:06 GMT+01:00 Staffan staffan.arvids...@gmail.com:

Hi,
Is there a good way (recommended way) to control and run multiple Spark
jobs
within the same application? My application is like follows;

1) Run one Spark job on a 'ful' dataset, which then creates a few thousands
of RDDs containing sub-datasets from the complete dataset. Each of the
sub-datasets are independent from the others (the 'ful' dataset is simply a
dump from a database containing several different types of records).
2) Run some filtration and manipulations on each of the RDD and finally do
some ML on the data. (Each of the created RDD's from step 1) is completely
independent so this should be run concurrently).

I've implemented this by using Scala Futures and executing the Spark jobs
in
2) from a separate thread for each RDD. This works and improves runtime
compared to a naive for-loop over step 2). Scaling is however not as good
as
I would expect it to be. (28 minutes for 4 cores on 1 machine - 19 minutes
for 12 cores on 3 machines).

Each of the sub-datasets are fairly small so I've used 'repartition' and
'cache' to store the sub-datasets on only one machine in step 1), this
improved runtime a few %.

So, either do anyone have a suggestion of how to do this in a better way or
perhaps if there a higher level workflow tool that I can use on top of
Spark? (The cool solution would have been to use nestled RDDs and just map
over them in a high level way, but as this is not supported afaik).

Thanks!
Staffan

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-control-concurrent-Spark-jobs-tp21800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to efficiently control concurrent Spark jobs

2015-02-25 Thread Staffan

Hi,
Is there a good way (recommended way) to control and run multiple Spark jobs
within the same application? My application is like follows;

I've implemented this by using Scala Futures and executing the Spark jobs in
2) from a separate thread for each RDD. This works and improves runtime
compared to a naive for-loop over step 2). Scaling is however not as good as
I would expect it to be. (28 minutes for 4 cores on 1 machine - 19 minutes
for 12 cores on 3 machines).

Each of the sub-datasets are fairly small so I've used 'repartition' and
'cache' to store the sub-datasets on only one machine in step 1), this
improved runtime a few %.

Thanks!
Staffan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to efficiently control concurrent Spark jobs

How to efficiently control concurrent Spark jobs

2 matches

Site Navigation

Mail list logo

Footer information