Hi,

Thank you for both responses.
Sun you pointed out the exact issue I was referring to, which is
copying,serializing, deserializing, the byte-array between the JVM heap and
the worker memory.
It also doesn't make sense why the byte-array should be kept on-heap, since
the data of the parent partition is just a byte array that only makes sense
to a python environment.
Shouldn't we be writing the byte-array off-heap and provide supporting
interfaces for outside processes to read and interact with the data?
I'm probably oversimplifying what is really required to do this.

There is a recent JIRA which I thought was interesting with respect to our
discussion.
https://issues.apache.org/jira/browse/SPARK-10399t JIRA

There's also a suggestion, at the bottom of the JIRA, that considers
exposing on-heap memory which is pretty interesting.

- Rahul Palamuttam


On Wed, Sep 9, 2015 at 4:52 AM, Sun, Rui <rui....@intel.com> wrote:

> Hi, Rahul,
>
> To support a new language other than Java/Scala in spark, it is different
> between RDD API and DataFrame API.
>
> For RDD API:
>
> RDD is a distributed collection of the language-specific data types whose
> representation is unknown to JVM. Also transformation functions for RDD are
> written in the language which can't be executed on JVM. That's why worker
> processes of the language runtime are needed in such case. Generally, to
> support RDD API in the language, a subclass of the Scala RDD is needed on
> JVM side (for example, PythonRDD for python, RRDD for R) where compute() is
> overridden to send the serialized parent partition data (yes, what you mean
> data copy happens here) and the serialized transformation function via
> socket to the worker process. The worker process deserializes the partition
> data and the transformation function, then applies the function to the
> data. The result is sent back to JVM via socket after serialization as byte
> array. From JVM's viewpoint, the resulting RDD is a collection of byte
> arrays.
>
> Performance is a concern in such case, as there are overheads, like
> launching of worker processes, serialization/deserialization of partition
> data, bi-directional communication cost of the data.
> Besides, as the JVM can't know the real representation of data in the RDD,
> it is difficult and complex to support shuffle and aggregation operations.
> The Spark Core's built-in aggregator and shuffle can't be utilized
> directly. There should be language specific implementation to support these
> operations, which cause additional overheads.
>
> Additional memory occupation by the worker processes is also a concern.
>
> For DataFrame API:
>
> Things are much simpler than RDD API. For DataFrame, data is read from
> Data Source API and is represented as native objects within the JVM and
> there is no language-specific transformation functions. Basically,
> DataFrame API in the language are just method wrappers to the corresponding
> ones in Scala DataFrame API.
>
> Performance is not a concern. The computation is done on native objects in
> JVM, virtually no performance lost.
>
> The only exception is UDF in DataFrame. The UDF() has to rely on language
> worker processes, similar to RDD API.
>
> -----Original Message-----
> From: Rahul Palamuttam [mailto:rahulpala...@gmail.com]
> Sent: Tuesday, September 8, 2015 10:54 AM
> To: user@spark.apache.org
> Subject: Support of other languages?
>
> Hi,
> I wanted to know more about how Spark supports R and Python, with respect
> to what gets copied into the language environments.
>
> To clarify :
>
> I know that PySpark utilizes py4j sockets to pass pickled python functions
> between the JVM and the python daemons. However, I wanted to know how it
> passes the data from the JVM into the daemon environment. I assume it has
> to copy the data over into the new environment, since python can't exactly
> operate in JVM heap space, (or can it?).
>
> I had the same question with respect to SparkR, though I'm not completely
> familiar with how they pass around native R code through the worker JVM's.
>
> The primary question I wanted to ask is does Spark make a second copy of
> data, so language-specific daemons can operate on the data? What are some
> of the other limitations encountered when we try to offer multi-language
> support, whether it's in performance or in general software architecture.
> With python in particular the collect operation must be first written to
> disk and then read back from the python driver process.
>
> Would appreciate any insight on this, and if there is any work happening
> in this area.
>
> Thank you,
>
> Rahul Palamuttam
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Support-of-other-languages-tp24599.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to