On the worker side, it all happens in Executor.  The task result is
computed here:

https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210

then its serialized along with some other goodies, and finally sent back to
the driver here:

https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255

What happens on the driver is quite a bit more complicated, and involves a
number of spots in the code, but at least to get you started, the results
are received here:

https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328

though perhaps a more interesting spot is where they are handled in
DagScheduler#handleTaskCompletion:
https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001


also, I think I know what you mean, but just to make sure: I wouldn't say
the results from the worker are "broadcast" back to the driver.  (a) in
spark, "broadcast" tends to refer to a particular api for sharing immutable
data from the driver to the workers (only one direction) and (b) it doesn't
really fit a more general meaning of "broadcast" either, since the results
are sent only to the driver, not to all nodes.

On Sun, Mar 29, 2015 at 8:34 PM, raggy <raghav0110...@gmail.com> wrote:

> I am a PhD student working on a research project related to Apache Spark. I
> am trying to modify some of the spark source code such that instead of
> sending the final result RDD from the worker nodes to a master node, I want
> to send the final result RDDs to some different node. In order to do this,
> I
> have been trying to identify at which point the Spark worker nodes
> broadcast
> the results of a job back to the master.
>
> So far, I understand that in Spark, the master serializes the RDD and the
> functions to be applied on them and sends them over to the worker nodes. In
> the context of reduce, it serializes the RDD partition and the reduce
> function and sends them to the worker nodes. However, my understanding of
> how things happen at the worker node is very limited and I would appreciate
> it if someone could help me identify where the process of broadcasting the
> results of local worker computations back to the master node takes place.
>
> This is some of the limited knowledge that I have about the worker nodes:
>
> Each job gets divided into smaller sets of tasks called stages. Each Stage
> is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the
> task results are used as input for another stage. The result stage uses the
> RDD to compute the action that initiated the job. So, this result stage
> executes the last task for the job on the worker node. I would assume after
> this is done, it gets the result and broadcasts it to the driver
> application(the master).
>
> In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler)
> it
> states "A task that sends back the output to the driver application.".
> However, I don't see when or where this happens in the source code. I would
> very much appreciate it if someone could help me identify where this
> happens
> in the Spark source code.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to