>From the logs, it seems that your tasks are being started in parallel. If they were being executed serially, then you would have seen the following in logs
Starting task 1 Finished task 1 Starting task 2 Finished task 2 Starting task 3 Finished task 3 ... Instead you are seeing Starting task 1 Starting task 2 ... and most probably ... Finished task 1 Finished task 2 .... That shows that they are starting on multiple machines and running concurrently. Regarding the other problem of trying to make sure only one command goes to each worker, that is not ensured by default as any partition can be executed on any machine (since there is not locality preference for each partition/task). I can think of the following way to solve it. 1. Set location preference! If you know all the names of the worker machines, you can created with a version of parallelize where you can set the preferred location of each partition. That will ensure a deterministic behavior of sending each partition to corresponding worker (assuming speculative execution, and delay scheduling is turned off). 2. To figure out the name of the worker nodes without hardcoding, you could run a dummy Spark job with many many partitions which will return the hostname of all the workers. Not that will try to ensure (but not guarantee) at least one partition will be scheduled on each active worker. In fact, if there are other jobs running in the system, then probably these dummy tasks will not get scheduled on all the workers. Hard to get around this without some outside mechanism to know all the workers in the cluster. Hope this helped. TD On 5/12/14, NevinLi158 <nevinli...@gmail.com> wrote: > I can't seem to get Spark to run the tasks in parallel. My spark code is > the > following: > > //Create commands to be piped into a C++ program > List<String> commandList = > makeCommandList(Integer.parseInt(step.first()),100); > > JavaRDD<String> commandListRDD = ctx.parallelize(commandList, > commandList.size()); > > //Run the C++ application > JavaRDD<String> workerOutput = commandListRDD.pipe("RandomC++Application"); > workerOutput.saveAsTextFile("output"); > > Running this code appears to make the system run all the tasks in series as > opposed to in parallel: any ideas as to what could be wrong? I'm guessing > that there is an issue with the serializer, due to the sample output below: > 14/05/12 17:17:32 INFO TaskSchedulerImpl: Adding task set 1.0 with 14 tasks > 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on > executor 2: neuro-1-3.local (PROCESS_LOCAL) > 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:0 as 4888 bytes > in 9 ms > 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on > executor 5: neuro-2-0.local (PROCESS_LOCAL) > 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:1 as 4890 bytes > in 1 ms > 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:2 as TID 2 on > executor 12: neuro-1-4.local (PROCESS_LOCAL) > 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:2 as 4890 bytes > in 1 ms > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Forcing-spark-to-send-exactly-one-element-to-each-worker-node-tp5605p5616.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >