Re: How to efficiently control concurrent Spark jobs

2015-02-26 Thread Jeffrey Jedele
So basically you have lots of small ML tasks you want to run concurrently?

With I've used repartition and cache to store the sub-datasets on only one
machine you mean that you reduced each RDD to have one partition only?

Maybe you want to give the fair scheduler a try to get more of your tasks
executing concurrently. Just an idea...

Regards,
Jeff

2015-02-25 12:06 GMT+01:00 Staffan staffan.arvids...@gmail.com:

 Hi,
 Is there a good way (recommended way) to control and run multiple Spark
 jobs
 within the same application? My application is like follows;

 1) Run one Spark job on a 'ful' dataset, which then creates a few thousands
 of RDDs containing sub-datasets from the complete dataset. Each of the
 sub-datasets are independent from the others (the 'ful' dataset is simply a
 dump from a database containing several different types of records).
 2) Run some filtration and manipulations on each of the RDD and finally do
 some ML on the data. (Each of the created RDD's from step 1) is completely
 independent so this should be run concurrently).

 I've implemented this by using Scala Futures and executing the Spark jobs
 in
 2) from a separate thread for each RDD. This works and improves runtime
 compared to a naive for-loop over step 2). Scaling is however not as good
 as
 I would expect it to be. (28 minutes for 4 cores on 1 machine - 19 minutes
 for 12 cores on 3 machines).

 Each of the sub-datasets are fairly small so I've used 'repartition' and
 'cache' to store the sub-datasets on only one machine in step 1), this
 improved runtime a few %.

 So, either do anyone have a suggestion of how to do this in a better way or
 perhaps if there a higher level workflow tool that I can use on top of
 Spark? (The cool solution would have been to use nestled RDDs and just map
 over them in a high level way, but as this is not supported afaik).

 Thanks!
 Staffan



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-control-concurrent-Spark-jobs-tp21800.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to efficiently control concurrent Spark jobs

2015-02-25 Thread Staffan
Hi,
Is there a good way (recommended way) to control and run multiple Spark jobs
within the same application? My application is like follows;

1) Run one Spark job on a 'ful' dataset, which then creates a few thousands
of RDDs containing sub-datasets from the complete dataset. Each of the
sub-datasets are independent from the others (the 'ful' dataset is simply a
dump from a database containing several different types of records). 
2) Run some filtration and manipulations on each of the RDD and finally do
some ML on the data. (Each of the created RDD's from step 1) is completely
independent so this should be run concurrently). 

I've implemented this by using Scala Futures and executing the Spark jobs in
2) from a separate thread for each RDD. This works and improves runtime
compared to a naive for-loop over step 2). Scaling is however not as good as
I would expect it to be. (28 minutes for 4 cores on 1 machine - 19 minutes
for 12 cores on 3 machines). 

Each of the sub-datasets are fairly small so I've used 'repartition' and
'cache' to store the sub-datasets on only one machine in step 1), this
improved runtime a few %. 

So, either do anyone have a suggestion of how to do this in a better way or
perhaps if there a higher level workflow tool that I can use on top of
Spark? (The cool solution would have been to use nestled RDDs and just map
over them in a high level way, but as this is not supported afaik).

Thanks!
Staffan 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-control-concurrent-Spark-jobs-tp21800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org