Hi, Thank you for both responses. Sun you pointed out the exact issue I was referring to, which is copying,serializing, deserializing, the byte-array between the JVM heap and the worker memory. It also doesn't make sense why the byte-array should be kept on-heap, since the data of the parent partition is just a byte array that only makes sense to a python environment. Shouldn't we be writing the byte-array off-heap and provide supporting interfaces for outside processes to read and interact with the data? I'm probably oversimplifying what is really required to do this.
There is a recent JIRA which I thought was interesting with respect to our discussion. https://issues.apache.org/jira/browse/SPARK-10399t JIRA There's also a suggestion, at the bottom of the JIRA, that considers exposing on-heap memory which is pretty interesting. - Rahul Palamuttam On Wed, Sep 9, 2015 at 4:52 AM, Sun, Rui <rui....@intel.com> wrote: > Hi, Rahul, > > To support a new language other than Java/Scala in spark, it is different > between RDD API and DataFrame API. > > For RDD API: > > RDD is a distributed collection of the language-specific data types whose > representation is unknown to JVM. Also transformation functions for RDD are > written in the language which can't be executed on JVM. That's why worker > processes of the language runtime are needed in such case. Generally, to > support RDD API in the language, a subclass of the Scala RDD is needed on > JVM side (for example, PythonRDD for python, RRDD for R) where compute() is > overridden to send the serialized parent partition data (yes, what you mean > data copy happens here) and the serialized transformation function via > socket to the worker process. The worker process deserializes the partition > data and the transformation function, then applies the function to the > data. The result is sent back to JVM via socket after serialization as byte > array. From JVM's viewpoint, the resulting RDD is a collection of byte > arrays. > > Performance is a concern in such case, as there are overheads, like > launching of worker processes, serialization/deserialization of partition > data, bi-directional communication cost of the data. > Besides, as the JVM can't know the real representation of data in the RDD, > it is difficult and complex to support shuffle and aggregation operations. > The Spark Core's built-in aggregator and shuffle can't be utilized > directly. There should be language specific implementation to support these > operations, which cause additional overheads. > > Additional memory occupation by the worker processes is also a concern. > > For DataFrame API: > > Things are much simpler than RDD API. For DataFrame, data is read from > Data Source API and is represented as native objects within the JVM and > there is no language-specific transformation functions. Basically, > DataFrame API in the language are just method wrappers to the corresponding > ones in Scala DataFrame API. > > Performance is not a concern. The computation is done on native objects in > JVM, virtually no performance lost. > > The only exception is UDF in DataFrame. The UDF() has to rely on language > worker processes, similar to RDD API. > > -----Original Message----- > From: Rahul Palamuttam [mailto:rahulpala...@gmail.com] > Sent: Tuesday, September 8, 2015 10:54 AM > To: user@spark.apache.org > Subject: Support of other languages? > > Hi, > I wanted to know more about how Spark supports R and Python, with respect > to what gets copied into the language environments. > > To clarify : > > I know that PySpark utilizes py4j sockets to pass pickled python functions > between the JVM and the python daemons. However, I wanted to know how it > passes the data from the JVM into the daemon environment. I assume it has > to copy the data over into the new environment, since python can't exactly > operate in JVM heap space, (or can it?). > > I had the same question with respect to SparkR, though I'm not completely > familiar with how they pass around native R code through the worker JVM's. > > The primary question I wanted to ask is does Spark make a second copy of > data, so language-specific daemons can operate on the data? What are some > of the other limitations encountered when we try to offer multi-language > support, whether it's in performance or in general software architecture. > With python in particular the collect operation must be first written to > disk and then read back from the python driver process. > > Would appreciate any insight on this, and if there is any work happening > in this area. > > Thank you, > > Rahul Palamuttam > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Support-of-other-languages-tp24599.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > >