RE: Support of other languages?

2015-09-22 Thread Sun, Rui
Although the data is RDD[Array[Byte]] where content is not meaningful to Spark 
Core, it has to be on heap, as Spark Core manipulates RDD transformations on 
heap.

SPARK-10399 is irrelevant. it aims to manipulate off-heap data using C++library 
via JNI. This is done in-process.

From: Rahul Palamuttam [mailto:rahulpala...@gmail.com]
Sent: Thursday, September 17, 2015 3:09 PM
To: Sun, Rui
Cc: user@spark.apache.org
Subject: Re: Support of other languages?

Hi,

Thank you for both responses.
Sun you pointed out the exact issue I was referring to, which is 
copying,serializing, deserializing, the byte-array between the JVM heap and the 
worker memory.
It also doesn't make sense why the byte-array should be kept on-heap, since the 
data of the parent partition is just a byte array that only makes sense to a 
python environment.
Shouldn't we be writing the byte-array off-heap and provide supporting 
interfaces for outside processes to read and interact with the data?
I'm probably oversimplifying what is really required to do this.

There is a recent JIRA which I thought was interesting with respect to our 
discussion.
https://issues.apache.org/jira/browse/SPARK-10399t JIRA

There's also a suggestion, at the bottom of the JIRA, that considers exposing 
on-heap memory which is pretty interesting.

- Rahul Palamuttam


On Wed, Sep 9, 2015 at 4:52 AM, Sun, Rui 
<rui@intel.com<mailto:rui@intel.com>> wrote:
Hi, Rahul,

To support a new language other than Java/Scala in spark, it is different 
between RDD API and DataFrame API.

For RDD API:

RDD is a distributed collection of the language-specific data types whose 
representation is unknown to JVM. Also transformation functions for RDD are 
written in the language which can't be executed on JVM. That's why worker 
processes of the language runtime are needed in such case. Generally, to 
support RDD API in the language, a subclass of the Scala RDD is needed on JVM 
side (for example, PythonRDD for python, RRDD for R) where compute() is 
overridden to send the serialized parent partition data (yes, what you mean 
data copy happens here) and the serialized transformation function via socket 
to the worker process. The worker process deserializes the partition data and 
the transformation function, then applies the function to the data. The result 
is sent back to JVM via socket after serialization as byte array. From JVM's 
viewpoint, the resulting RDD is a collection of byte arrays.

Performance is a concern in such case, as there are overheads, like launching 
of worker processes, serialization/deserialization of partition data, 
bi-directional communication cost of the data.
Besides, as the JVM can't know the real representation of data in the RDD, it 
is difficult and complex to support shuffle and aggregation operations. The 
Spark Core's built-in aggregator and shuffle can't be utilized directly. There 
should be language specific implementation to support these operations, which 
cause additional overheads.

Additional memory occupation by the worker processes is also a concern.

For DataFrame API:

Things are much simpler than RDD API. For DataFrame, data is read from Data 
Source API and is represented as native objects within the JVM and there is no 
language-specific transformation functions. Basically, DataFrame API in the 
language are just method wrappers to the corresponding ones in Scala DataFrame 
API.

Performance is not a concern. The computation is done on native objects in JVM, 
virtually no performance lost.

The only exception is UDF in DataFrame. The UDF() has to rely on language 
worker processes, similar to RDD API.

-Original Message-
From: Rahul Palamuttam 
[mailto:rahulpala...@gmail.com<mailto:rahulpala...@gmail.com>]
Sent: Tuesday, September 8, 2015 10:54 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Support of other languages?

Hi,
I wanted to know more about how Spark supports R and Python, with respect to 
what gets copied into the language environments.

To clarify :

I know that PySpark utilizes py4j sockets to pass pickled python functions 
between the JVM and the python daemons. However, I wanted to know how it passes 
the data from the JVM into the daemon environment. I assume it has to copy the 
data over into the new environment, since python can't exactly operate in JVM 
heap space, (or can it?).

I had the same question with respect to SparkR, though I'm not completely 
familiar with how they pass around native R code through the worker JVM's.

The primary question I wanted to ask is does Spark make a second copy of data, 
so language-specific daemons can operate on the data? What are some of the 
other limitations encountered when we try to offer multi-language support, 
whether it's in performance or in general software architecture.
With python in particular the collect operation must be first written to disk 
and then read back from the python driver proces

Re: Support of other languages?

2015-09-17 Thread Rahul Palamuttam
Hi,

Thank you for both responses.
Sun you pointed out the exact issue I was referring to, which is
copying,serializing, deserializing, the byte-array between the JVM heap and
the worker memory.
It also doesn't make sense why the byte-array should be kept on-heap, since
the data of the parent partition is just a byte array that only makes sense
to a python environment.
Shouldn't we be writing the byte-array off-heap and provide supporting
interfaces for outside processes to read and interact with the data?
I'm probably oversimplifying what is really required to do this.

There is a recent JIRA which I thought was interesting with respect to our
discussion.
https://issues.apache.org/jira/browse/SPARK-10399t JIRA

There's also a suggestion, at the bottom of the JIRA, that considers
exposing on-heap memory which is pretty interesting.

- Rahul Palamuttam


On Wed, Sep 9, 2015 at 4:52 AM, Sun, Rui <rui@intel.com> wrote:

> Hi, Rahul,
>
> To support a new language other than Java/Scala in spark, it is different
> between RDD API and DataFrame API.
>
> For RDD API:
>
> RDD is a distributed collection of the language-specific data types whose
> representation is unknown to JVM. Also transformation functions for RDD are
> written in the language which can't be executed on JVM. That's why worker
> processes of the language runtime are needed in such case. Generally, to
> support RDD API in the language, a subclass of the Scala RDD is needed on
> JVM side (for example, PythonRDD for python, RRDD for R) where compute() is
> overridden to send the serialized parent partition data (yes, what you mean
> data copy happens here) and the serialized transformation function via
> socket to the worker process. The worker process deserializes the partition
> data and the transformation function, then applies the function to the
> data. The result is sent back to JVM via socket after serialization as byte
> array. From JVM's viewpoint, the resulting RDD is a collection of byte
> arrays.
>
> Performance is a concern in such case, as there are overheads, like
> launching of worker processes, serialization/deserialization of partition
> data, bi-directional communication cost of the data.
> Besides, as the JVM can't know the real representation of data in the RDD,
> it is difficult and complex to support shuffle and aggregation operations.
> The Spark Core's built-in aggregator and shuffle can't be utilized
> directly. There should be language specific implementation to support these
> operations, which cause additional overheads.
>
> Additional memory occupation by the worker processes is also a concern.
>
> For DataFrame API:
>
> Things are much simpler than RDD API. For DataFrame, data is read from
> Data Source API and is represented as native objects within the JVM and
> there is no language-specific transformation functions. Basically,
> DataFrame API in the language are just method wrappers to the corresponding
> ones in Scala DataFrame API.
>
> Performance is not a concern. The computation is done on native objects in
> JVM, virtually no performance lost.
>
> The only exception is UDF in DataFrame. The UDF() has to rely on language
> worker processes, similar to RDD API.
>
> -Original Message-
> From: Rahul Palamuttam [mailto:rahulpala...@gmail.com]
> Sent: Tuesday, September 8, 2015 10:54 AM
> To: user@spark.apache.org
> Subject: Support of other languages?
>
> Hi,
> I wanted to know more about how Spark supports R and Python, with respect
> to what gets copied into the language environments.
>
> To clarify :
>
> I know that PySpark utilizes py4j sockets to pass pickled python functions
> between the JVM and the python daemons. However, I wanted to know how it
> passes the data from the JVM into the daemon environment. I assume it has
> to copy the data over into the new environment, since python can't exactly
> operate in JVM heap space, (or can it?).
>
> I had the same question with respect to SparkR, though I'm not completely
> familiar with how they pass around native R code through the worker JVM's.
>
> The primary question I wanted to ask is does Spark make a second copy of
> data, so language-specific daemons can operate on the data? What are some
> of the other limitations encountered when we try to offer multi-language
> support, whether it's in performance or in general software architecture.
> With python in particular the collect operation must be first written to
> disk and then read back from the python driver process.
>
> Would appreciate any insight on this, and if there is any work happening
> in this area.
>
> Thank you,
>
> Rahul Palamuttam
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.100156

RE: Support of other languages?

2015-09-09 Thread Sun, Rui
Hi, Rahul,

To support a new language other than Java/Scala in spark, it is different 
between RDD API and DataFrame API.

For RDD API:

RDD is a distributed collection of the language-specific data types whose 
representation is unknown to JVM. Also transformation functions for RDD are 
written in the language which can't be executed on JVM. That's why worker 
processes of the language runtime are needed in such case. Generally, to 
support RDD API in the language, a subclass of the Scala RDD is needed on JVM 
side (for example, PythonRDD for python, RRDD for R) where compute() is 
overridden to send the serialized parent partition data (yes, what you mean 
data copy happens here) and the serialized transformation function via socket 
to the worker process. The worker process deserializes the partition data and 
the transformation function, then applies the function to the data. The result 
is sent back to JVM via socket after serialization as byte array. From JVM's 
viewpoint, the resulting RDD is a collection of byte arrays.

Performance is a concern in such case, as there are overheads, like launching 
of worker processes, serialization/deserialization of partition data, 
bi-directional communication cost of the data.
Besides, as the JVM can't know the real representation of data in the RDD, it 
is difficult and complex to support shuffle and aggregation operations. The 
Spark Core's built-in aggregator and shuffle can't be utilized directly. There 
should be language specific implementation to support these operations, which 
cause additional overheads.

Additional memory occupation by the worker processes is also a concern.

For DataFrame API:

Things are much simpler than RDD API. For DataFrame, data is read from Data 
Source API and is represented as native objects within the JVM and there is no 
language-specific transformation functions. Basically, DataFrame API in the 
language are just method wrappers to the corresponding ones in Scala DataFrame 
API.

Performance is not a concern. The computation is done on native objects in JVM, 
virtually no performance lost.

The only exception is UDF in DataFrame. The UDF() has to rely on language 
worker processes, similar to RDD API.

-Original Message-
From: Rahul Palamuttam [mailto:rahulpala...@gmail.com] 
Sent: Tuesday, September 8, 2015 10:54 AM
To: user@spark.apache.org
Subject: Support of other languages?

Hi,
I wanted to know more about how Spark supports R and Python, with respect to 
what gets copied into the language environments.

To clarify :

I know that PySpark utilizes py4j sockets to pass pickled python functions 
between the JVM and the python daemons. However, I wanted to know how it passes 
the data from the JVM into the daemon environment. I assume it has to copy the 
data over into the new environment, since python can't exactly operate in JVM 
heap space, (or can it?).  

I had the same question with respect to SparkR, though I'm not completely 
familiar with how they pass around native R code through the worker JVM's. 

The primary question I wanted to ask is does Spark make a second copy of data, 
so language-specific daemons can operate on the data? What are some of the 
other limitations encountered when we try to offer multi-language support, 
whether it's in performance or in general software architecture.
With python in particular the collect operation must be first written to disk 
and then read back from the python driver process.

Would appreciate any insight on this, and if there is any work happening in 
this area.

Thank you,

Rahul Palamuttam  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Support-of-other-languages-tp24599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Support of other languages?

2015-09-08 Thread Nagaraj Chandrashekar
Hi Rahul, 

I may not have the answer for what you are looking for but my thoughts are
given below. 

I have worked with HP Vertica and R VIA UDF¹s (User Defined Functions).  I
don¹t have any experience with Spark R till now. I would expect it might
follow the similar route.

UDF functions containing external shared libraries which is responsible
for some analytics procedure.  These procedure would run as part of the
Vertica process context to process the data stored in its data structures.
 Badly written UDF¹s code can slow down the entire process.

You can refer following URL for further reading on HP Vertica and R
integration. 

https://my.vertica.com/docs/5.0/HTML/Master/15713.htm
https://www.vertica.com/tag/vertica-2/page/8/ (See A Deeper Dive on
Vertica and R section)

Cheers
Nagaraj C

Learn And Share! It¹s Big Data.







On 9/8/15, 8:24 AM, "Rahul Palamuttam"  wrote:

>Hi, 
>I wanted to know more about how Spark supports R and Python, with respect
>to
>what gets copied into the language environments.
>
>To clarify :
>
>I know that PySpark utilizes py4j sockets to pass pickled python functions
>between the JVM and the python daemons. However, I wanted to know how it
>passes the data from the JVM into the daemon environment. I assume it has
>to
>copy the data over into the new environment, since python can't exactly
>operate in JVM heap space, (or can it?).
>
>I had the same question with respect to SparkR, though I'm not completely
>familiar with how they pass around native R code through the worker
>JVM's. 
>
>The primary question I wanted to ask is does Spark make a second copy of
>data, so language-specific daemons can operate on the data? What are some
>of
>the other limitations encountered when we try to offer multi-language
>support, whether it's in performance or in general software architecture.
>With python in particular the collect operation must be first written to
>disk and then read back from the python driver process.
>
>Would appreciate any insight on this, and if there is any work happening
>in
>this area.
>
>Thank you,
>
>Rahul Palamuttam  
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/Support-of-other-langu
>ages-tp24599.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Support of other languages?

2015-09-07 Thread Rahul Palamuttam
Hi, 
I wanted to know more about how Spark supports R and Python, with respect to
what gets copied into the language environments.

To clarify :

I know that PySpark utilizes py4j sockets to pass pickled python functions
between the JVM and the python daemons. However, I wanted to know how it
passes the data from the JVM into the daemon environment. I assume it has to
copy the data over into the new environment, since python can't exactly
operate in JVM heap space, (or can it?).  

I had the same question with respect to SparkR, though I'm not completely
familiar with how they pass around native R code through the worker JVM's. 

The primary question I wanted to ask is does Spark make a second copy of
data, so language-specific daemons can operate on the data? What are some of
the other limitations encountered when we try to offer multi-language
support, whether it's in performance or in general software architecture.
With python in particular the collect operation must be first written to
disk and then read back from the python driver process.

Would appreciate any insight on this, and if there is any work happening in
this area.

Thank you,

Rahul Palamuttam  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Support-of-other-languages-tp24599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org