So basically you have lots of small ML tasks you want to run concurrently?

With "I've used repartition and cache to store the sub-datasets on only one
machine" you mean that you reduced each RDD to have one partition only?

Maybe you want to give the fair scheduler a try to get more of your tasks
executing concurrently. Just an idea...

Regards,
Jeff

2015-02-25 12:06 GMT+01:00 Staffan <staffan.arvids...@gmail.com>:

> Hi,
> Is there a good way (recommended way) to control and run multiple Spark
> jobs
> within the same application? My application is like follows;
>
> 1) Run one Spark job on a 'ful' dataset, which then creates a few thousands
> of RDDs containing sub-datasets from the complete dataset. Each of the
> sub-datasets are independent from the others (the 'ful' dataset is simply a
> dump from a database containing several different types of records).
> 2) Run some filtration and manipulations on each of the RDD and finally do
> some ML on the data. (Each of the created RDD's from step 1) is completely
> independent so this should be run concurrently).
>
> I've implemented this by using Scala Futures and executing the Spark jobs
> in
> 2) from a separate thread for each RDD. This works and improves runtime
> compared to a naive for-loop over step 2). Scaling is however not as good
> as
> I would expect it to be. (28 minutes for 4 cores on 1 machine -> 19 minutes
> for 12 cores on 3 machines).
>
> Each of the sub-datasets are fairly small so I've used 'repartition' and
> 'cache' to store the sub-datasets on only one machine in step 1), this
> improved runtime a few %.
>
> So, either do anyone have a suggestion of how to do this in a better way or
> perhaps if there a higher level workflow tool that I can use on top of
> Spark? (The cool solution would have been to use nestled RDDs and just map
> over them in a high level way, but as this is not supported afaik).
>
> Thanks!
> Staffan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-control-concurrent-Spark-jobs-tp21800.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to