Hi, sharing what I discovered with PySpark too, corroborates with what Amit notices and also interested in the pipe question: h ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3c201603291521.u2tflbfo024...@d06av05.portsmouth.uk.ibm.com%3E
// Start a thread to feed the process input from our parent's iterator val writerThread = new WriterThread(env, worker, inputIterator, partitionIndex, context) ... // Return an iterator that read lines from the process's stdout val stream = new DataInputStream(new BufferedInputStream(worker.getInputStream, bufferSize)) The above code and what follows look to be the important parts. Note that Josh Rosen replied to my comment with more information: "One clarification: there are Python interpreters running on executors so that Python UDFs and RDD API code can be executed. Some slightly-outdated but mostly-correct reference material for this can be found at https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. See also: search the Spark codebase for PythonRDD and look at python/pyspark/worker.py" From: Reynold Xin <r...@databricks.com> To: Amit Rana <amitranavs...@gmail.com> Cc: "dev@spark.apache.org" <dev@spark.apache.org> Date: 08/07/2016 07:03 Subject: Re: Understanding pyspark data flow on worker nodes You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana <amitranavs...@gmail.com> wrote: Hi all, Did anyone get a chance to look into it?? Any sort of guidance will be much appreciated. Thanks, Amit Rana On 7 Jul 2016 14:28, "Amit Rana" <amitranavs...@gmail.com> wrote: As mentioned in the documentation: PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed. I am trying to understand the implementation of how this data transfer is happening using pipes. Can anyone please guide me along that line?? Thanks, Amit Rana On 7 Jul 2016 13:44, "Sun Rui" <sunrise_...@163.com> wrote: You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors via sockets instead of pipes. On Jul 7, 2016, at 15:49, Amit Rana <amitranavs...@gmail.com> wrote: Hi all, I am trying to trace the data flow in pyspark. I am using intellij IDEA in windows 7. I had submitted a python job as follows: --master local[4] <path to pyspark job> <arguments to the job> I have made the following insights after running the above command in debug mode: ->Locally when a pyspark's interpreter starts, it also starts a JVM with which it communicates through socket. ->py4j is used to handle this communication ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext which communicates with the spark executors in cluster. In cluster I have read that the data flow between spark executors and python interpreter happens using pipes. But I am not able to trace that data flow. Please correct me if my understanding is wrong. It would be very helpful if, someone can help me understand tge code-flow for data transfer between JVM and python workers. Thanks, Amit Rana Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU