Most of the information you're asking for can be found on the Spark web UI (see here <http://spark.apache.org/docs/1.1.0/monitoring.html>). You can see which tasks are being processed by which nodes.
If you're using HDFS and your file size is smaller than the HDFS block size you will only have one partition (remember, there is exactly one task for each partition in a stage). If you want to force it to have more partitions, you can call RDD.repartition(numPartitions). Note that this will introduce a shuffle you wouldn't otherwise have. Also make sure your job is allocated more than one core in your cluster (you can see this on the web UI). On Fri, Nov 14, 2014 at 2:18 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > I have instrumented word count to track how many machines the code runs > on. I use an accumulator to maintain a Set or MacAddresses. I find that > everything is done on a single machine. This is probably optimal for word > count but not the larger problems I am working on. > How to a force processing to be split into multiple tasks. How to I access > the task and attempt numbers to track which processing happens in which > attempt. Also is using MacAddress to determine which machine is running the > code. > As far as I can tell a simple word count is running in one thread on one > machine and the remainder of the cluster does nothing, > This is consistent with tests where I write to sdout from functions and > see little output on most machines in the cluster > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io