Hi, Is there a good way (recommended way) to control and run multiple Spark jobs within the same application? My application is like follows;
1) Run one Spark job on a 'ful' dataset, which then creates a few thousands of RDDs containing sub-datasets from the complete dataset. Each of the sub-datasets are independent from the others (the 'ful' dataset is simply a dump from a database containing several different types of records). 2) Run some filtration and manipulations on each of the RDD and finally do some ML on the data. (Each of the created RDD's from step 1) is completely independent so this should be run concurrently). I've implemented this by using Scala Futures and executing the Spark jobs in 2) from a separate thread for each RDD. This works and improves runtime compared to a naive for-loop over step 2). Scaling is however not as good as I would expect it to be. (28 minutes for 4 cores on 1 machine -> 19 minutes for 12 cores on 3 machines). Each of the sub-datasets are fairly small so I've used 'repartition' and 'cache' to store the sub-datasets on only one machine in step 1), this improved runtime a few %. So, either do anyone have a suggestion of how to do this in a better way or perhaps if there a higher level workflow tool that I can use on top of Spark? (The cool solution would have been to use nestled RDDs and just map over them in a high level way, but as this is not supported afaik). Thanks! Staffan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-control-concurrent-Spark-jobs-tp21800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org