Re: Execute function once on each node
" I am working on a spark application that requires the ability to run a function on each node in the cluster " -- Use Apache Ignite instead of Spark. Trust me it's awesome for this use case. Regards, Rabin Banerjee On Jul 19, 2016 3:27 AM, "joshuata"wrote: > I am working on a spark application that requires the ability to run a > function on each node in the cluster. This is used to read data from a > directory that is not globally accessible to the cluster. I have tried > creating an RDD with n elements and n partitions so that it is evenly > distributed among the n nodes, and then mapping a function over the RDD. > However, the runtime makes no guarantees that each partition will be stored > on a separate node. This means that the code will run multiple times on the > same node while never running on another. > > I have looked through the documentation and source code for both RDDs and > the scheduler, but I haven't found anything that will do what I need. Does > anybody know of a solution I could use? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Execute function once on each node
Technical limitations keep us from running another filesystem on the SSDs. We are running on a very large HPC cluster without control over low-level system components. We have tried setting up an ad-hoc HDFS cluster on the nodes in our allocation, but we have had very little luck. It ends up being very brittle and difficult for the simulation code to access. On Tue, Jul 19, 2016 at 7:08 AM Koert Kuiperswrote: > The whole point of a well designed global filesystem is to not move the > data > > On Jul 19, 2016 10:07, "Koert Kuipers" wrote: > >> If you run hdfs on those ssds (with low replication factor) wouldn't it >> also effectively write to local disk with low latency? >> >> On Jul 18, 2016 21:54, "Josh Asplund" wrote: >> >> The spark workers are running side-by-side with scientific simulation >> code. The code writes output to local SSDs to keep latency low. Due to the >> volume of data being moved (10's of terabytes +), it isn't really feasible >> to copy the data to a global filesystem. Executing a function on each node >> would allow us to read the data in situ without a copy. >> >> I understand that manually assigning tasks to nodes reduces fault >> tolerance, but the simulation codes already explicitly assign tasks, so a >> failure of any one node is already a full-job failure. >> >> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar < >> aniket.bhatna...@gmail.com> wrote: >> >>> You can't assume that the number to nodes will be constant as some may >>> fail, hence you can't guarantee that a function will execute at most once >>> or atleast once on a node. Can you explain your use case in a bit more >>> detail? >>> >>> On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: >>> I am working on a spark application that requires the ability to run a function on each node in the cluster. This is used to read data from a directory that is not globally accessible to the cluster. I have tried creating an RDD with n elements and n partitions so that it is evenly distributed among the n nodes, and then mapping a function over the RDD. However, the runtime makes no guarantees that each partition will be stored on a separate node. This means that the code will run multiple times on the same node while never running on another. I have looked through the documentation and source code for both RDDs and the scheduler, but I haven't found anything that will do what I need. Does anybody know of a solution I could use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>
Re: Execute function once on each node
Thank you for that advice. I have tried similar techniques, but not that one. On Mon, Jul 18, 2016 at 11:42 PM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Thanks for the explanation. Try creating a custom RDD whose getPartitions > returns an array of custom partition objects of size n (= number of nodes). > In a custom partition object, you can have the file path and ip/hostname > where the partition needs to be computed. Then, have getPreferredLocations > return the ip/hostname from the partition object and in compute function, > assert that you are in right ip/hostname (or fail) and read the content of > the file. > > Not a 100% sure it will work though. > > On Tue, Jul 19, 2016, 2:54 AM Josh Asplundwrote: > >> The spark workers are running side-by-side with scientific simulation >> code. The code writes output to local SSDs to keep latency low. Due to the >> volume of data being moved (10's of terabytes +), it isn't really feasible >> to copy the data to a global filesystem. Executing a function on each node >> would allow us to read the data in situ without a copy. >> >> I understand that manually assigning tasks to nodes reduces fault >> tolerance, but the simulation codes already explicitly assign tasks, so a >> failure of any one node is already a full-job failure. >> >> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar < >> aniket.bhatna...@gmail.com> wrote: >> >>> You can't assume that the number to nodes will be constant as some may >>> fail, hence you can't guarantee that a function will execute at most once >>> or atleast once on a node. Can you explain your use case in a bit more >>> detail? >>> >>> On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: >>> I am working on a spark application that requires the ability to run a function on each node in the cluster. This is used to read data from a directory that is not globally accessible to the cluster. I have tried creating an RDD with n elements and n partitions so that it is evenly distributed among the n nodes, and then mapping a function over the RDD. However, the runtime makes no guarantees that each partition will be stored on a separate node. This means that the code will run multiple times on the same node while never running on another. I have looked through the documentation and source code for both RDDs and the scheduler, but I haven't found anything that will do what I need. Does anybody know of a solution I could use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Execute function once on each node
The whole point of a well designed global filesystem is to not move the data On Jul 19, 2016 10:07, "Koert Kuipers"wrote: > If you run hdfs on those ssds (with low replication factor) wouldn't it > also effectively write to local disk with low latency? > > On Jul 18, 2016 21:54, "Josh Asplund" wrote: > > The spark workers are running side-by-side with scientific simulation > code. The code writes output to local SSDs to keep latency low. Due to the > volume of data being moved (10's of terabytes +), it isn't really feasible > to copy the data to a global filesystem. Executing a function on each node > would allow us to read the data in situ without a copy. > > I understand that manually assigning tasks to nodes reduces fault > tolerance, but the simulation codes already explicitly assign tasks, so a > failure of any one node is already a full-job failure. > > On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> You can't assume that the number to nodes will be constant as some may >> fail, hence you can't guarantee that a function will execute at most once >> or atleast once on a node. Can you explain your use case in a bit more >> detail? >> >> On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: >> >>> I am working on a spark application that requires the ability to run a >>> function on each node in the cluster. This is used to read data from a >>> directory that is not globally accessible to the cluster. I have tried >>> creating an RDD with n elements and n partitions so that it is evenly >>> distributed among the n nodes, and then mapping a function over the RDD. >>> However, the runtime makes no guarantees that each partition will be >>> stored >>> on a separate node. This means that the code will run multiple times on >>> the >>> same node while never running on another. >>> >>> I have looked through the documentation and source code for both RDDs and >>> the scheduler, but I haven't found anything that will do what I need. >>> Does >>> anybody know of a solution I could use? >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >
Re: Execute function once on each node
Thanks for the explanation. Try creating a custom RDD whose getPartitions returns an array of custom partition objects of size n (= number of nodes). In a custom partition object, you can have the file path and ip/hostname where the partition needs to be computed. Then, have getPreferredLocations return the ip/hostname from the partition object and in compute function, assert that you are in right ip/hostname (or fail) and read the content of the file. Not a 100% sure it will work though. On Tue, Jul 19, 2016, 2:54 AM Josh Asplundwrote: > The spark workers are running side-by-side with scientific simulation > code. The code writes output to local SSDs to keep latency low. Due to the > volume of data being moved (10's of terabytes +), it isn't really feasible > to copy the data to a global filesystem. Executing a function on each node > would allow us to read the data in situ without a copy. > > I understand that manually assigning tasks to nodes reduces fault > tolerance, but the simulation codes already explicitly assign tasks, so a > failure of any one node is already a full-job failure. > > On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> You can't assume that the number to nodes will be constant as some may >> fail, hence you can't guarantee that a function will execute at most once >> or atleast once on a node. Can you explain your use case in a bit more >> detail? >> >> On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: >> >>> I am working on a spark application that requires the ability to run a >>> function on each node in the cluster. This is used to read data from a >>> directory that is not globally accessible to the cluster. I have tried >>> creating an RDD with n elements and n partitions so that it is evenly >>> distributed among the n nodes, and then mapping a function over the RDD. >>> However, the runtime makes no guarantees that each partition will be >>> stored >>> on a separate node. This means that the code will run multiple times on >>> the >>> same node while never running on another. >>> >>> I have looked through the documentation and source code for both RDDs and >>> the scheduler, but I haven't found anything that will do what I need. >>> Does >>> anybody know of a solution I could use? >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>>
Re: Execute function once on each node
The spark workers are running side-by-side with scientific simulation code. The code writes output to local SSDs to keep latency low. Due to the volume of data being moved (10's of terabytes +), it isn't really feasible to copy the data to a global filesystem. Executing a function on each node would allow us to read the data in situ without a copy. I understand that manually assigning tasks to nodes reduces fault tolerance, but the simulation codes already explicitly assign tasks, so a failure of any one node is already a full-job failure. On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagarwrote: > You can't assume that the number to nodes will be constant as some may > fail, hence you can't guarantee that a function will execute at most once > or atleast once on a node. Can you explain your use case in a bit more > detail? > > On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: > >> I am working on a spark application that requires the ability to run a >> function on each node in the cluster. This is used to read data from a >> directory that is not globally accessible to the cluster. I have tried >> creating an RDD with n elements and n partitions so that it is evenly >> distributed among the n nodes, and then mapping a function over the RDD. >> However, the runtime makes no guarantees that each partition will be >> stored >> on a separate node. This means that the code will run multiple times on >> the >> same node while never running on another. >> >> I have looked through the documentation and source code for both RDDs and >> the scheduler, but I haven't found anything that will do what I need. Does >> anybody know of a solution I could use? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Re: Execute function once on each node
You can't assume that the number to nodes will be constant as some may fail, hence you can't guarantee that a function will execute at most once or atleast once on a node. Can you explain your use case in a bit more detail? On Mon, Jul 18, 2016, 10:57 PM joshuatawrote: > I am working on a spark application that requires the ability to run a > function on each node in the cluster. This is used to read data from a > directory that is not globally accessible to the cluster. I have tried > creating an RDD with n elements and n partitions so that it is evenly > distributed among the n nodes, and then mapping a function over the RDD. > However, the runtime makes no guarantees that each partition will be stored > on a separate node. This means that the code will run multiple times on the > same node while never running on another. > > I have looked through the documentation and source code for both RDDs and > the scheduler, but I haven't found anything that will do what I need. Does > anybody know of a solution I could use? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >