Hi, Rahul,

To support a new language other than Java/Scala in spark, it is different 
between RDD API and DataFrame API.

For RDD API:

RDD is a distributed collection of the language-specific data types whose 
representation is unknown to JVM. Also transformation functions for RDD are 
written in the language which can't be executed on JVM. That's why worker 
processes of the language runtime are needed in such case. Generally, to 
support RDD API in the language, a subclass of the Scala RDD is needed on JVM 
side (for example, PythonRDD for python, RRDD for R) where compute() is 
overridden to send the serialized parent partition data (yes, what you mean 
data copy happens here) and the serialized transformation function via socket 
to the worker process. The worker process deserializes the partition data and 
the transformation function, then applies the function to the data. The result 
is sent back to JVM via socket after serialization as byte array. From JVM's 
viewpoint, the resulting RDD is a collection of byte arrays.

Performance is a concern in such case, as there are overheads, like launching 
of worker processes, serialization/deserialization of partition data, 
bi-directional communication cost of the data.
Besides, as the JVM can't know the real representation of data in the RDD, it 
is difficult and complex to support shuffle and aggregation operations. The 
Spark Core's built-in aggregator and shuffle can't be utilized directly. There 
should be language specific implementation to support these operations, which 
cause additional overheads.

Additional memory occupation by the worker processes is also a concern.

For DataFrame API:

Things are much simpler than RDD API. For DataFrame, data is read from Data 
Source API and is represented as native objects within the JVM and there is no 
language-specific transformation functions. Basically, DataFrame API in the 
language are just method wrappers to the corresponding ones in Scala DataFrame 
API.

Performance is not a concern. The computation is done on native objects in JVM, 
virtually no performance lost.

The only exception is UDF in DataFrame. The UDF() has to rely on language 
worker processes, similar to RDD API.

-----Original Message-----
From: Rahul Palamuttam [mailto:rahulpala...@gmail.com] 
Sent: Tuesday, September 8, 2015 10:54 AM
To: user@spark.apache.org
Subject: Support of other languages?

Hi,
I wanted to know more about how Spark supports R and Python, with respect to 
what gets copied into the language environments.

To clarify :

I know that PySpark utilizes py4j sockets to pass pickled python functions 
between the JVM and the python daemons. However, I wanted to know how it passes 
the data from the JVM into the daemon environment. I assume it has to copy the 
data over into the new environment, since python can't exactly operate in JVM 
heap space, (or can it?).  

I had the same question with respect to SparkR, though I'm not completely 
familiar with how they pass around native R code through the worker JVM's. 

The primary question I wanted to ask is does Spark make a second copy of data, 
so language-specific daemons can operate on the data? What are some of the 
other limitations encountered when we try to offer multi-language support, 
whether it's in performance or in general software architecture.
With python in particular the collect operation must be first written to disk 
and then read back from the python driver process.

Would appreciate any insight on this, and if there is any work happening in 
this area.

Thank you,

Rahul Palamuttam  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Support-of-other-languages-tp24599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to