I am a PhD student trying to understand the internals of Spark, so that I can make some modifications to it. I am trying to understand how the aggregation of the distributed datasets(through the network) onto the driver node works. I would very much appreciate it if someone could point me towards the source code that is involved with the aggregation over the network. An explanation on how it works would also be appreciated.
So far, I have followed the code to identify that the handleJobSubmitted() function in DAGScheduler.scala is invoked when trying to schedule a job. And then since I am trying to run it on a cluster, I reach listenerBus.post(SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties)) on line 759 in DAGScheduler.scala. I am not sure where to go from here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregation-of-distributed-datasets-tp22048.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org