The spark workers are running side-by-side with scientific simulation code. The code writes output to local SSDs to keep latency low. Due to the volume of data being moved (10's of terabytes +), it isn't really feasible to copy the data to a global filesystem. Executing a function on each node would allow us to read the data in situ without a copy.
I understand that manually assigning tasks to nodes reduces fault tolerance, but the simulation codes already explicitly assign tasks, so a failure of any one node is already a full-job failure. On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <aniket.bhatna...@gmail.com> wrote: > You can't assume that the number to nodes will be constant as some may > fail, hence you can't guarantee that a function will execute at most once > or atleast once on a node. Can you explain your use case in a bit more > detail? > > On Mon, Jul 18, 2016, 10:57 PM joshuata <joshaspl...@gmail.com> wrote: > >> I am working on a spark application that requires the ability to run a >> function on each node in the cluster. This is used to read data from a >> directory that is not globally accessible to the cluster. I have tried >> creating an RDD with n elements and n partitions so that it is evenly >> distributed among the n nodes, and then mapping a function over the RDD. >> However, the runtime makes no guarantees that each partition will be >> stored >> on a separate node. This means that the code will run multiple times on >> the >> same node while never running on another. >> >> I have looked through the documentation and source code for both RDDs and >> the scheduler, but I haven't found anything that will do what I need. Does >> anybody know of a solution I could use? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>