Re: Task result in Spark Worker Node

2015-04-17 Thread Raghav Shankar
/TaskSchedulerImpl.scala#L328
  
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328
 
 though perhaps a more interesting spot is where they are handled in 
 DagScheduler#handleTaskCompletion:
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001
  
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001
 
 
 also, I think I know what you mean, but just to make sure: I wouldn't say the 
 results from the worker are broadcast back to the driver.  (a) in spark, 
 broadcast tends to refer to a particular api for sharing immutable data 
 from the driver to the workers (only one direction) and (b) it doesn't really 
 fit a more general meaning of broadcast either, since the results are sent 
 only to the driver, not to all nodes.
 
 On Sun, Mar 29, 2015 at 8:34 PM, raggy raghav0110...@gmail.com 
 mailto:raghav0110...@gmail.com wrote:
 I am a PhD student working on a research project related to Apache Spark. I
 am trying to modify some of the spark source code such that instead of
 sending the final result RDD from the worker nodes to a master node, I want
 to send the final result RDDs to some different node. In order to do this, I
 have been trying to identify at which point the Spark worker nodes broadcast
 the results of a job back to the master.
 
 So far, I understand that in Spark, the master serializes the RDD and the
 functions to be applied on them and sends them over to the worker nodes. In
 the context of reduce, it serializes the RDD partition and the reduce
 function and sends them to the worker nodes. However, my understanding of
 how things happen at the worker node is very limited and I would appreciate
 it if someone could help me identify where the process of broadcasting the
 results of local worker computations back to the master node takes place.
 
 This is some of the limited knowledge that I have about the worker nodes:
 
 Each job gets divided into smaller sets of tasks called stages. Each Stage
 is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the
 task results are used as input for another stage. The result stage uses the
 RDD to compute the action that initiated the job. So, this result stage
 executes the last task for the job on the worker node. I would assume after
 this is done, it gets the result and broadcasts it to the driver
 application(the master).
 
 In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) it
 states A task that sends back the output to the driver application..
 However, I don't see when or where this happens in the source code. I would
 very much appreciate it if someone could help me identify where this happens
 in the Spark source code.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: Task result in Spark Worker Node

2015-04-17 Thread Raghav Shankar
/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210
 
 then its serialized along with some other goodies, and finally sent back to 
 the driver here:
 
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255
  
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255
 
 What happens on the driver is quite a bit more complicated, and involves a 
 number of spots in the code, but at least to get you started, the results 
 are received here:
 
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328
  
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328
 
 though perhaps a more interesting spot is where they are handled in 
 DagScheduler#handleTaskCompletion:
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001
  
 https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001
 
 
 also, I think I know what you mean, but just to make sure: I wouldn't say 
 the results from the worker are broadcast back to the driver.  (a) in 
 spark, broadcast tends to refer to a particular api for sharing immutable 
 data from the driver to the workers (only one direction) and (b) it doesn't 
 really fit a more general meaning of broadcast either, since the results 
 are sent only to the driver, not to all nodes.
 
 On Sun, Mar 29, 2015 at 8:34 PM, raggy raghav0110...@gmail.com 
 mailto:raghav0110...@gmail.com wrote:
 I am a PhD student working on a research project related to Apache Spark. I
 am trying to modify some of the spark source code such that instead of
 sending the final result RDD from the worker nodes to a master node, I want
 to send the final result RDDs to some different node. In order to do this, I
 have been trying to identify at which point the Spark worker nodes broadcast
 the results of a job back to the master.
 
 So far, I understand that in Spark, the master serializes the RDD and the
 functions to be applied on them and sends them over to the worker nodes. In
 the context of reduce, it serializes the RDD partition and the reduce
 function and sends them to the worker nodes. However, my understanding of
 how things happen at the worker node is very limited and I would appreciate
 it if someone could help me identify where the process of broadcasting the
 results of local worker computations back to the master node takes place.
 
 This is some of the limited knowledge that I have about the worker nodes:
 
 Each job gets divided into smaller sets of tasks called stages. Each Stage
 is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the
 task results are used as input for another stage. The result stage uses the
 RDD to compute the action that initiated the job. So, this result stage
 executes the last task for the job on the worker node. I would assume after
 this is done, it gets the result and broadcasts it to the driver
 application(the master).
 
 In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) it
 states A task that sends back the output to the driver application..
 However, I don't see when or where this happens in the source code. I would
 very much appreciate it if someone could help me identify where this happens
 in the Spark source code.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 http://nabble.com/.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 
 



Re: Task result in Spark Worker Node

2015-04-13 Thread Imran Rashid
On the worker side, it all happens in Executor.  The task result is
computed here:

https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210

then its serialized along with some other goodies, and finally sent back to
the driver here:

https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L255

What happens on the driver is quite a bit more complicated, and involves a
number of spots in the code, but at least to get you started, the results
are received here:

https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328

though perhaps a more interesting spot is where they are handled in
DagScheduler#handleTaskCompletion:
https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1001


also, I think I know what you mean, but just to make sure: I wouldn't say
the results from the worker are broadcast back to the driver.  (a) in
spark, broadcast tends to refer to a particular api for sharing immutable
data from the driver to the workers (only one direction) and (b) it doesn't
really fit a more general meaning of broadcast either, since the results
are sent only to the driver, not to all nodes.

On Sun, Mar 29, 2015 at 8:34 PM, raggy raghav0110...@gmail.com wrote:

 I am a PhD student working on a research project related to Apache Spark. I
 am trying to modify some of the spark source code such that instead of
 sending the final result RDD from the worker nodes to a master node, I want
 to send the final result RDDs to some different node. In order to do this,
 I
 have been trying to identify at which point the Spark worker nodes
 broadcast
 the results of a job back to the master.

 So far, I understand that in Spark, the master serializes the RDD and the
 functions to be applied on them and sends them over to the worker nodes. In
 the context of reduce, it serializes the RDD partition and the reduce
 function and sends them to the worker nodes. However, my understanding of
 how things happen at the worker node is very limited and I would appreciate
 it if someone could help me identify where the process of broadcasting the
 results of local worker computations back to the master node takes place.

 This is some of the limited knowledge that I have about the worker nodes:

 Each job gets divided into smaller sets of tasks called stages. Each Stage
 is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the
 task results are used as input for another stage. The result stage uses the
 RDD to compute the action that initiated the job. So, this result stage
 executes the last task for the job on the worker node. I would assume after
 this is done, it gets the result and broadcasts it to the driver
 application(the master).

 In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler)
 it
 states A task that sends back the output to the driver application..
 However, I don't see when or where this happens in the source code. I would
 very much appreciate it if someone could help me identify where this
 happens
 in the Spark source code.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Task result in Spark Worker Node

2015-03-29 Thread raggy
I am a PhD student working on a research project related to Apache Spark. I
am trying to modify some of the spark source code such that instead of
sending the final result RDD from the worker nodes to a master node, I want
to send the final result RDDs to some different node. In order to do this, I
have been trying to identify at which point the Spark worker nodes broadcast
the results of a job back to the master. 

So far, I understand that in Spark, the master serializes the RDD and the
functions to be applied on them and sends them over to the worker nodes. In
the context of reduce, it serializes the RDD partition and the reduce
function and sends them to the worker nodes. However, my understanding of
how things happen at the worker node is very limited and I would appreciate
it if someone could help me identify where the process of broadcasting the
results of local worker computations back to the master node takes place. 

This is some of the limited knowledge that I have about the worker nodes: 

Each job gets divided into smaller sets of tasks called stages. Each Stage
is either a Shuffle Map Stage or Result Stage. In a Shuffle Map Stage, the
task results are used as input for another stage. The result stage uses the
RDD to compute the action that initiated the job. So, this result stage
executes the last task for the job on the worker node. I would assume after
this is done, it gets the result and broadcasts it to the driver
application(the master). 

In ResultTask.scala(spark-core src/main/scala org.apache.spark.scheduler) it
states A task that sends back the output to the driver application..
However, I don't see when or where this happens in the source code. I would
very much appreciate it if someone could help me identify where this happens
in the Spark source code. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Task-result-in-Spark-Worker-Node-tp22283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org