A few more data points: my current theory is now that spark's piping
mechanism is considerably slower than just running the C++ app directly on
the node.
I ran the C++ application directly on a node in the cluster, and timed the
execution of various parts of the program, and got ~10 seconds to run
I can't seem to get Spark to run the tasks in parallel. My spark code is the
following:
//Create commands to be piped into a C++ program
List commandList =
makeCommandList(Integer.parseInt(step.first()),100);
JavaRDD commandListRDD = ctx.parallelize(commandList,
commandList.size());
//Run the C+
Fixed the problem as soon as I sent this out, sigh. Apparently you can do
this by changing the number of slices to cut the dataset into: I thought
that was identical to the amount of partitions, but apparently not.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble
Hi all,
I'm currently trying to use pipe to run C++ code on each worker node, and I
have an RDD of essentially command line arguments that i'm passing to each
node. I want to send exactly one element to each node, but when I run my
code, Spark ends up sending multiple elements to a node: is there