Re: Memory compute-intensive tasks

2014-08-04 Thread rpandya
producing only a little data. Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p11375.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Memory compute-intensive tasks

2014-07-29 Thread rpandya
on? Thanks, Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p10868.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Memory compute-intensive tasks

2014-07-18 Thread rpandya
Hi Matei- Changing to coalesce(numNodes, true) still runs all partitions on a single node, which I verified by printing the hostname before I exec the external process. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks

Re: Memory compute-intensive tasks

2014-07-18 Thread rpandya
://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p10209.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Memory compute-intensive tasks

2014-07-16 Thread rpandya
/Memory-compute-intensive-tasks-tp9643p9991.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Memory compute-intensive tasks

2014-07-16 Thread Liquan Pei
(DFSInputStream.java:619) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p9991.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Liquan Pei Department of Physics University

Memory compute-intensive tasks

2014-07-14 Thread Ravi Pandya
I'm trying to run a job that includes an invocation of a memory compute-intensive multithreaded C++ program, and so I'd like to run one task per physical node. Using rdd.coalesce(# nodes) seems to just allocate one task per core, and so runs out of memory on the node. Is there any way to give the

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
I don't have a solution for you (sorry), but do note that rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you set shuffle=true then it should repartition and redistribute the data. But it uses the hash partitioner according to the ScalaDoc - I don't know of any way to supply a

Re: Memory compute-intensive tasks

2014-07-14 Thread Matei Zaharia
I think coalesce with shuffle=true will force it to have one task per node. Without that, it might be that due to data locality it decides to launch multiple ones on the same node even though the total # of tasks is equal to the # of nodes. If this is the *only* thing you run on the cluster,

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
Depending on how your C++ program is designed, maybe you can feed the data from multiple partitions into the same process? Getting the results back might be tricky. But that may be the only way to guarantee you're only using one invocation per node. On Mon, Jul 14, 2014 at 5:12 PM, Matei Zaharia