It makes sense if you're parallelizing jobs that have relatively few
tasks, and have a lot of execution slots available. It makes sense to
turn them loose all at once and try to use the parallelism available.

There are downsides, eventually: for example, N jobs accessing one
cached RDD may recompute the RDD's partitions many times since the
cached copy may not be available when many of them start. At some
level, way oversubscribing your cluster with a backlog of tasks is
bad. And you might find it's a net loss if a bunch of tasks try to
schedule at the same time that all access the same data, since only
some can be local to the data.

On Fri, Jan 15, 2016 at 8:11 PM, Jakob Odersky <joder...@gmail.com> wrote:
> I stand corrected. How considerable are the benefits though? Will the
> scheduler be able to dispatch jobs from both actions simultaneously (or on a
> when-workers-become-available basis)?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to