So basically you have lots of small ML tasks you want to run concurrently? With "I've used repartition and cache to store the sub-datasets on only one machine" you mean that you reduced each RDD to have one partition only?
Maybe you want to give the fair scheduler a try to get more of your tasks executing concurrently. Just an idea... Regards, Jeff 2015-02-25 12:06 GMT+01:00 Staffan <staffan.arvids...@gmail.com>: > Hi, > Is there a good way (recommended way) to control and run multiple Spark > jobs > within the same application? My application is like follows; > > 1) Run one Spark job on a 'ful' dataset, which then creates a few thousands > of RDDs containing sub-datasets from the complete dataset. Each of the > sub-datasets are independent from the others (the 'ful' dataset is simply a > dump from a database containing several different types of records). > 2) Run some filtration and manipulations on each of the RDD and finally do > some ML on the data. (Each of the created RDD's from step 1) is completely > independent so this should be run concurrently). > > I've implemented this by using Scala Futures and executing the Spark jobs > in > 2) from a separate thread for each RDD. This works and improves runtime > compared to a naive for-loop over step 2). Scaling is however not as good > as > I would expect it to be. (28 minutes for 4 cores on 1 machine -> 19 minutes > for 12 cores on 3 machines). > > Each of the sub-datasets are fairly small so I've used 'repartition' and > 'cache' to store the sub-datasets on only one machine in step 1), this > improved runtime a few %. > > So, either do anyone have a suggestion of how to do this in a better way or > perhaps if there a higher level workflow tool that I can use on top of > Spark? (The cool solution would have been to use nestled RDDs and just map > over them in a high level way, but as this is not supported afaik). > > Thanks! > Staffan > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-control-concurrent-Spark-jobs-tp21800.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >