I am working on a spark application that requires the ability to run a function on each node in the cluster. This is used to read data from a directory that is not globally accessible to the cluster. I have tried creating an RDD with n elements and n partitions so that it is evenly distributed among the n nodes, and then mapping a function over the RDD. However, the runtime makes no guarantees that each partition will be stored on a separate node. This means that the code will run multiple times on the same node while never running on another.
I have looked through the documentation and source code for both RDDs and the scheduler, but I haven't found anything that will do what I need. Does anybody know of a solution I could use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org