producing only a little data.
Ravi
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p11375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
on?
Thanks,
Ravi
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p10868.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Matei-
Changing to coalesce(numNodes, true) still runs all partitions on a single
node, which I verified by printing the hostname before I exec the external
process.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks
://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p10209.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
/Memory-compute-intensive-tasks-tp9643p9991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
(DFSInputStream.java:619)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p9991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Liquan Pei
Department of Physics
University
I'm trying to run a job that includes an invocation of a memory
compute-intensive multithreaded C++ program, and so I'd like to run one
task per physical node. Using rdd.coalesce(# nodes) seems to just allocate
one task per core, and so runs out of memory on the node. Is there any way
to give the
I don't have a solution for you (sorry), but do note that
rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you
set shuffle=true then it should repartition and redistribute the data. But
it uses the hash partitioner according to the ScalaDoc - I don't know of
any way to supply a
I think coalesce with shuffle=true will force it to have one task per node.
Without that, it might be that due to data locality it decides to launch
multiple ones on the same node even though the total # of tasks is equal to the
# of nodes.
If this is the *only* thing you run on the cluster,
Depending on how your C++ program is designed, maybe you can feed the data
from multiple partitions into the same process? Getting the results back
might be tricky. But that may be the only way to guarantee you're only
using one invocation per node.
On Mon, Jul 14, 2014 at 5:12 PM, Matei Zaharia
10 matches
Mail list logo