On the worker side, it all happens in Executor. The task result is computed here:
https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210 then its serialized along with some other goodies, and finally sent back to the driver here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255 What happens on the driver is quite a bit more complicated, and involves a number of spots in the code, but at least to get you started, the results are received here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328 though perhaps a more interesting spot is where they are handled in DagScheduler#handleTaskCompletion: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001 also, I think I know what you mean, but just to make sure: I wouldn't say the results from the worker are "broadcast" back to the driver. (a) in spark, "broadcast" tends to refer to a particular api for sharing immutable data from the driver to the workers (only one direction) and (b) it doesn't really fit a more general meaning of "broadcast" either, since the results are sent only to the driver, not to all nodes. On Sun, Mar 29, 2015 at 8:34 PM, raggy <raghav0110...@gmail.com> wrote: > I am a PhD student working on a research project related to Apache Spark. I > am trying to modify some of the spark source code such that instead of > sending the final result RDD from the worker nodes to a master node, I want > to send the final result RDDs to some different node. In order to do this, > I > have been trying to identify at which point the Spark worker nodes > broadcast > the results of a job back to the master. > > So far, I understand that in Spark, the master serializes the RDD and the > functions to be applied on them and sends them over to the worker nodes. In > the context of reduce, it serializes the RDD partition and the reduce > function and sends them to the worker nodes. However, my understanding of > how things happen at the worker node is very limited and I would appreciate > it if someone could help me identify where the process of broadcasting the > results of local worker computations back to the master node takes place. > > This is some of the limited knowledge that I have about the worker nodes: > > Each job gets divided into smaller sets of tasks called stages. Each Stage > is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the > task results are used as input for another stage. The result stage uses the > RDD to compute the action that initiated the job. So, this result stage > executes the last task for the job on the worker node. I would assume after > this is done, it gets the result and broadcasts it to the driver > application(the master). > > In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) > it > states "A task that sends back the output to the driver application.". > However, I don't see when or where this happens in the source code. I would > very much appreciate it if someone could help me identify where this > happens > in the Spark source code. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >