You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana <amitranavs...@gmail.com> wrote: > Hi all, > > Did anyone get a chance to look into it?? > Any sort of guidance will be much appreciated. > > Thanks, > Amit Rana > On 7 Jul 2016 14:28, "Amit Rana" <amitranavs...@gmail.com> wrote: > >> As mentioned in the documentation: >> PythonRDD objects launch Python subprocesses and communicate with them >> using pipes, sending the user's code and the data to be processed. >> >> I am trying to understand the implementation of how this data transfer >> is happening using pipes. >> Can anyone please guide me along that line?? >> >> Thanks, >> Amit Rana >> On 7 Jul 2016 13:44, "Sun Rui" <sunrise_...@163.com> wrote: >> >>> You can read >>> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals >>> For pySpark data flow on worker nodes, you can read the source code of >>> PythonRDD.scala. Python worker processes communicate with Spark executors >>> via sockets instead of pipes. >>> >>> On Jul 7, 2016, at 15:49, Amit Rana <amitranavs...@gmail.com> wrote: >>> >>> Hi all, >>> >>> I am trying to trace the data flow in pyspark. I am using intellij IDEA >>> in windows 7. >>> I had submitted a python job as follows: >>> --master local[4] <path to pyspark job> <arguments to the job> >>> >>> I have made the following insights after running the above command in >>> debug mode: >>> ->Locally when a pyspark's interpreter starts, it also starts a JVM with >>> which it communicates through socket. >>> ->py4j is used to handle this communication >>> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext >>> which communicates with the spark executors in cluster. >>> >>> In cluster I have read that the data flow between spark executors and >>> python interpreter happens using pipes. But I am not able to trace that >>> data flow. >>> >>> Please correct me if my understanding is wrong. It would be very helpful >>> if, someone can help me understand tge code-flow for data transfer between >>> JVM and python workers. >>> >>> Thanks, >>> Amit Rana >>> >>> >>>