While "apparently" saturating the N available workers using your proposed N
partitions - the "actual" distribution of workers to tasks is controlled by
the scheduler.  If my past experience were of service - you can *not *trust
the default Fair Scheduler to ensure the round-robin scheduling of the
tasks: you may well end up with tasks being queued.

The suggestion is to try it out on the resource manager and scheduler being
used for your deployment. You may need to swap out their default scheduler
for a true round robin.

2016-11-19 16:44 GMT-08:00 Adam Smith <adamsmith8...@gmail.com>:

> Dear community,
>
> I have a RDD with N rows and N partitions. I want to ensure that the
> partitions run all at the some time, by setting the number of vcores
> (spark-yarn) to N. The partitions need to talk to each other with some
> socket based sync that is why I need them to run more or less
> simultaneously.
>
> Let's assume no node will die. Will my setup guarantee that all partitions
> are computed in parallel?
>
> I know this is somehow hackish. Is there a better way doing so?
>
> My goal is replicate message passing (like OpenMPI) with spark, where I
> have very specific and final communcation requirements. So no need for the
> many comm and sync funtionality, just what I already have - sync and talk.
>
> Thanks!
> Adam
>
>

Reply via email to