Re: Understanding pyspark data flow on worker nodes

2016-07-08 Thread Adam Roberts
vs...@gmail.com> Cc: "dev@spark.apache.org" <dev@spark.apache.org> Date: 08/07/2016 07:03 Subject: Re: Understanding pyspark data flow on worker nodes You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/

Re: Understanding pyspark data flow on worker nodes

2016-07-08 Thread Reynold Xin
You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana wrote: > Hi all, > > Did anyone get a chance to look into it?? > Any sort of

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
As mentioned in the documentation: PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed. I am trying to understand the implementation of how this data transfer is happening using pipes. Can anyone please guide

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Sun Rui
You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors

Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
Hi all, I am trying to trace the data flow in pyspark. I am using intellij IDEA in windows 7. I had submitted a python job as follows: --master local[4] I have made the following insights after running the above command in debug mode: ->Locally when a pyspark's interpreter starts, it also