Re: Task result in Spark Worker Node
/TaskSchedulerImpl.scala#L328 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328 though perhaps a more interesting spot is where they are handled in DagScheduler#handleTaskCompletion: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001 also, I think I know what you mean, but just to make sure: I wouldn't say the results from the worker are broadcast back to the driver. (a) in spark, broadcast tends to refer to a particular api for sharing immutable data from the driver to the workers (only one direction) and (b) it doesn't really fit a more general meaning of broadcast either, since the results are sent only to the driver, not to all nodes. On Sun, Mar 29, 2015 at 8:34 PM, raggy raghav0110...@gmail.com mailto:raghav0110...@gmail.com wrote: I am a PhD student working on a research project related to Apache Spark. I am trying to modify some of the spark source code such that instead of sending the final result RDD from the worker nodes to a master node, I want to send the final result RDDs to some different node. In order to do this, I have been trying to identify at which point the Spark worker nodes broadcast the results of a job back to the master. So far, I understand that in Spark, the master serializes the RDD and the functions to be applied on them and sends them over to the worker nodes. In the context of reduce, it serializes the RDD partition and the reduce function and sends them to the worker nodes. However, my understanding of how things happen at the worker node is very limited and I would appreciate it if someone could help me identify where the process of broadcasting the results of local worker computations back to the master node takes place. This is some of the limited knowledge that I have about the worker nodes: Each job gets divided into smaller sets of tasks called stages. Each Stage is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the task results are used as input for another stage. The result stage uses the RDD to compute the action that initiated the job. So, this result stage executes the last task for the job on the worker node. I would assume after this is done, it gets the result and broadcasts it to the driver application(the master). In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) it states A task that sends back the output to the driver application.. However, I don't see when or where this happens in the source code. I would very much appreciate it if someone could help me identify where this happens in the Spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Task result in Spark Worker Node
/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210 then its serialized along with some other goodies, and finally sent back to the driver here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255 What happens on the driver is quite a bit more complicated, and involves a number of spots in the code, but at least to get you started, the results are received here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328 though perhaps a more interesting spot is where they are handled in DagScheduler#handleTaskCompletion: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001 also, I think I know what you mean, but just to make sure: I wouldn't say the results from the worker are broadcast back to the driver. (a) in spark, broadcast tends to refer to a particular api for sharing immutable data from the driver to the workers (only one direction) and (b) it doesn't really fit a more general meaning of broadcast either, since the results are sent only to the driver, not to all nodes. On Sun, Mar 29, 2015 at 8:34 PM, raggy raghav0110...@gmail.com mailto:raghav0110...@gmail.com wrote: I am a PhD student working on a research project related to Apache Spark. I am trying to modify some of the spark source code such that instead of sending the final result RDD from the worker nodes to a master node, I want to send the final result RDDs to some different node. In order to do this, I have been trying to identify at which point the Spark worker nodes broadcast the results of a job back to the master. So far, I understand that in Spark, the master serializes the RDD and the functions to be applied on them and sends them over to the worker nodes. In the context of reduce, it serializes the RDD partition and the reduce function and sends them to the worker nodes. However, my understanding of how things happen at the worker node is very limited and I would appreciate it if someone could help me identify where the process of broadcasting the results of local worker computations back to the master node takes place. This is some of the limited knowledge that I have about the worker nodes: Each job gets divided into smaller sets of tasks called stages. Each Stage is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the task results are used as input for another stage. The result stage uses the RDD to compute the action that initiated the job. So, this result stage executes the last task for the job on the worker node. I would assume after this is done, it gets the result and broadcasts it to the driver application(the master). In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) it states A task that sends back the output to the driver application.. However, I don't see when or where this happens in the source code. I would very much appreciate it if someone could help me identify where this happens in the Spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html Sent from the Apache Spark User List mailing list archive at Nabble.com http://nabble.com/. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Task result in Spark Worker Node
On the worker side, it all happens in Executor. The task result is computed here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210 then its serialized along with some other goodies, and finally sent back to the driver here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255 What happens on the driver is quite a bit more complicated, and involves a number of spots in the code, but at least to get you started, the results are received here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328 though perhaps a more interesting spot is where they are handled in DagScheduler#handleTaskCompletion: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001 also, I think I know what you mean, but just to make sure: I wouldn't say the results from the worker are broadcast back to the driver. (a) in spark, broadcast tends to refer to a particular api for sharing immutable data from the driver to the workers (only one direction) and (b) it doesn't really fit a more general meaning of broadcast either, since the results are sent only to the driver, not to all nodes. On Sun, Mar 29, 2015 at 8:34 PM, raggy raghav0110...@gmail.com wrote: I am a PhD student working on a research project related to Apache Spark. I am trying to modify some of the spark source code such that instead of sending the final result RDD from the worker nodes to a master node, I want to send the final result RDDs to some different node. In order to do this, I have been trying to identify at which point the Spark worker nodes broadcast the results of a job back to the master. So far, I understand that in Spark, the master serializes the RDD and the functions to be applied on them and sends them over to the worker nodes. In the context of reduce, it serializes the RDD partition and the reduce function and sends them to the worker nodes. However, my understanding of how things happen at the worker node is very limited and I would appreciate it if someone could help me identify where the process of broadcasting the results of local worker computations back to the master node takes place. This is some of the limited knowledge that I have about the worker nodes: Each job gets divided into smaller sets of tasks called stages. Each Stage is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the task results are used as input for another stage. The result stage uses the RDD to compute the action that initiated the job. So, this result stage executes the last task for the job on the worker node. I would assume after this is done, it gets the result and broadcasts it to the driver application(the master). In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) it states A task that sends back the output to the driver application.. However, I don't see when or where this happens in the source code. I would very much appreciate it if someone could help me identify where this happens in the Spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Task result in Spark Worker Node
I am a PhD student working on a research project related to Apache Spark. I am trying to modify some of the spark source code such that instead of sending the final result RDD from the worker nodes to a master node, I want to send the final result RDDs to some different node. In order to do this, I have been trying to identify at which point the Spark worker nodes broadcast the results of a job back to the master. So far, I understand that in Spark, the master serializes the RDD and the functions to be applied on them and sends them over to the worker nodes. In the context of reduce, it serializes the RDD partition and the reduce function and sends them to the worker nodes. However, my understanding of how things happen at the worker node is very limited and I would appreciate it if someone could help me identify where the process of broadcasting the results of local worker computations back to the master node takes place. This is some of the limited knowledge that I have about the worker nodes: Each job gets divided into smaller sets of tasks called stages. Each Stage is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the task results are used as input for another stage. The result stage uses the RDD to compute the action that initiated the job. So, this result stage executes the last task for the job on the worker node. I would assume after this is done, it gets the result and broadcasts it to the driver application(the master). In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) it states A task that sends back the output to the driver application.. However, I don't see when or where this happens in the source code. I would very much appreciate it if someone could help me identify where this happens in the Spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org