I think coalesce with shuffle=true will force it to have one task per node. 
Without that, it might be that due to data locality it decides to launch 
multiple ones on the same node even though the total # of tasks is equal to the 
# of nodes.

If this is the *only* thing you run on the cluster, you could also configure 
the Workers to only report one core by manually launching the 
spark.deploy.worker.Worker process with that flag (see 
http://spark.apache.org/docs/latest/spark-standalone.html).

Matei

On Jul 14, 2014, at 1:59 PM, Daniel Siegmann <daniel.siegm...@velos.io> wrote:

> I don't have a solution for you (sorry), but do note that 
> rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you set 
> shuffle=true then it should repartition and redistribute the data. But it 
> uses the hash partitioner according to the ScalaDoc - I don't know of any way 
> to supply a custom partitioner.
> 
> 
> On Mon, Jul 14, 2014 at 4:09 PM, Ravi Pandya <r...@iecommerce.com> wrote:
> I'm trying to run a job that includes an invocation of a memory & 
> compute-intensive multithreaded C++ program, and so I'd like to run one task 
> per physical node. Using rdd.coalesce(# nodes) seems to just allocate one 
> task per core, and so runs out of memory on the node. Is there any way to 
> give the scheduler a hint that the task uses lots of memory and cores so it 
> spreads it out more evenly?
> 
> Thanks,
> 
> Ravi Pandya
> Microsoft Research
> 
> 
> 
> -- 
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
> 
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to