Re: Forcing spark to send exactly one element to each worker node

Tathagata Das Mon, 12 May 2014 22:50:19 -0700

>From the logs, it seems that your tasks are being started in parallel.
If they were being executed serially, then you would have seen the
following in logs

Starting task 1
Finished task 1
Starting task 2
Finished task 2
Starting task 3
Finished task 3
...

Instead you are seeing

Starting task 1
Starting task 2
... and most probably ...
Finished task 1
Finished task 2
....

That shows that they are starting on multiple machines and running
concurrently.

Regarding the other problem of trying to make sure only one command
goes to each worker, that is not ensured by default as any partition
can be executed on any machine (since there is not locality preference
for each partition/task). I can think of the following way to solve
it.

1. Set location preference! If you know all the names of the worker
machines, you can created with a version of parallelize where you can
set the preferred location of each partition. That will ensure a
deterministic behavior of sending each partition to corresponding
worker (assuming speculative execution, and delay scheduling is turned
off).

2. To figure out the name of the worker nodes without hardcoding, you
could run a dummy Spark job with many many partitions which will
return the hostname of all the workers. Not that will try to ensure
(but not guarantee) at least one partition will be scheduled on each
active worker. In fact, if there are other jobs running in the system,
then probably these dummy tasks will not get scheduled on all the
workers. Hard to get around this without some outside mechanism to
know all the workers in the cluster.

Hope this helped.

TD

On 5/12/14, NevinLi158 <nevinli...@gmail.com> wrote:
> I can't seem to get Spark to run the tasks in parallel. My spark code is
> the
> following:
>
> //Create commands to be piped into a C++ program
> List<String> commandList =
> makeCommandList(Integer.parseInt(step.first()),100);
>
> JavaRDD<String> commandListRDD = ctx.parallelize(commandList,
> commandList.size());
>
> //Run the C++ application
> JavaRDD<String> workerOutput = commandListRDD.pipe("RandomC++Application");
> workerOutput.saveAsTextFile("output");
>
> Running this code appears to make the system run all the tasks in series as
> opposed to in parallel: any ideas as to what could be wrong? I'm guessing
> that there is an issue with the serializer, due to the sample output below:
> 14/05/12 17:17:32 INFO TaskSchedulerImpl: Adding task set 1.0 with 14 tasks
> 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on
> executor 2: neuro-1-3.local (PROCESS_LOCAL)
> 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:0 as 4888 bytes
> in 9 ms
> 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on
> executor 5: neuro-2-0.local (PROCESS_LOCAL)
> 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:1 as 4890 bytes
> in 1 ms
> 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:2 as TID 2 on
> executor 12: neuro-1-4.local (PROCESS_LOCAL)
> 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:2 as 4890 bytes
> in 1 ms
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Forcing-spark-to-send-exactly-one-element-to-each-worker-node-tp5605p5616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Forcing spark to send exactly one element to each worker node

Reply via email to