Most of the information you're asking for can be found on the Spark web UI (see
here <http://spark.apache.org/docs/1.1.0/monitoring.html>). You can see
which tasks are being processed by which nodes.

If you're using HDFS and your file size is smaller than the HDFS block size
you will only have one partition (remember, there is exactly one task for
each partition in a stage). If you want to force it to have more
partitions, you can call RDD.repartition(numPartitions). Note that this
will introduce a shuffle you wouldn't otherwise have.

Also make sure your job is allocated more than one core in your cluster
(you can see this on the web UI).

On Fri, Nov 14, 2014 at 2:18 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:

>  I have instrumented word count to track how many machines the code runs
> on. I use an accumulator to maintain a Set or MacAddresses. I find that
> everything is done on a single machine. This is probably optimal for word
> count but not the larger problems I am working on.
> How to a force processing to be split into multiple tasks. How to I access
> the task and attempt numbers to track which processing happens in which
> attempt. Also is using MacAddress to determine which machine is running the
> code.
> As far as I can tell a simple word count is running in one thread on  one
> machine and the remainder of the cluster does nothing,
> This is consistent with tests where I write to sdout from functions and
> see little output on most machines in the cluster
>
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to