It makes sense if you're parallelizing jobs that have relatively few tasks, and have a lot of execution slots available. It makes sense to turn them loose all at once and try to use the parallelism available.
There are downsides, eventually: for example, N jobs accessing one cached RDD may recompute the RDD's partitions many times since the cached copy may not be available when many of them start. At some level, way oversubscribing your cluster with a backlog of tasks is bad. And you might find it's a net loss if a bunch of tasks try to schedule at the same time that all access the same data, since only some can be local to the data. On Fri, Jan 15, 2016 at 8:11 PM, Jakob Odersky <joder...@gmail.com> wrote: > I stand corrected. How considerable are the benefits though? Will the > scheduler be able to dispatch jobs from both actions simultaneously (or on a > when-workers-become-available basis)? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org