[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2017-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165055#comment-16165055
 ] 

Apache Spark commented on SPARK-10399:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19222

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2017-11-06 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240304#comment-16240304
 ] 

Jim Pivarski commented on SPARK-10399:
--

WontFix because PR 19222 has no conflicts and will be merged, or because 
off-heap memory will instead be exposed as Arrow buffers, or because you don't 
intend to support this feature? I'm just asking for clarification, not asking 
you to change your minds.


> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2017-11-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240314#comment-16240314
 ] 

Sean Owen commented on SPARK-10399:
---

I think this is mostly superseded by Arrow's intended role here, yeah.

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-09-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607532#comment-16607532
 ] 

Apache Spark commented on SPARK-10399:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22361

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-09-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607533#comment-16607533
 ] 

Apache Spark commented on SPARK-10399:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22361

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428000#comment-16428000
 ] 

Apache Spark commented on SPARK-10399:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20991

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-04-20 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446166#comment-16446166
 ] 

Kazuaki Ishizaki commented on SPARK-10399:
--

https://issues.apache.org/jira/browse/SPARK-23879 is the following JIRA entry.

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446176#comment-16446176
 ] 

Apache Spark commented on SPARK-10399:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21117

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2016-02-02 Thread Kent Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129928#comment-15129928
 ] 

Kent Yao commented on SPARK-10399:
--

How does this work go?

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2015-09-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726886#comment-14726886
 ] 

Saisai Shao commented on SPARK-10399:
-

Hi [~paulweiss], a simple question about C++ library to access Spark memory, 
what is the benefit and usage model to access Spark offheap memory with C++ 
library, currently AFAIK Tungsten mainly stores shuffle / SQL aggregation (or 
others) data in offheap memory, so basically I'd like to know what is the best 
usage scenario with this C++ library?

Thanks a lot :). 

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2015-09-03 Thread Paul Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729026#comment-14729026
 ] 

Paul Weiss commented on SPARK-10399:


One example albeit contrived is as follows:

You have a huge set of images that you want to filter, aggregate, and group 
based on meta data associated with each image.  Spark will do a great job doing 
all this but suppose you have a low level C++ library that does sophisticated 
image recognition.  Rewriting that library in another language is not 
practical.  In addition the images you have are of a very large size so copying 
them out of process is also not an option because it would make your 
application too slow.  Having the ability to pass a reference to the off heap 
memory (that would represent an image in this case) for your special sauce 
image recognition library to run would be extremely beneficial especially since 
it could run in parallel across your cluster for different images.

The idea would be to reuse the Tungsten code to achieve this and also leverage 
the flexible dataframe api.  It would not be related to shuffles directly.

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2015-09-06 Thread Paul Wais (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733169#comment-14733169
 ] 

Paul Wais commented on SPARK-10399:
---

Image processing is a great use case.  I've deployed a JNA-based image 
processing Spark app on a cluster of ~200 cores and one of the pain points was 
memory management.  That solution copied images (via memcpy) since there was 
not time to implement a better solution.  Spark would have the JVM use 
essentially all available memory and would not account for native off-heap 
usage, so the native code would typically trigger an OOM after a while.  Tuning 
to curtail OOMs was hard.  Direct access to off-heap memory would have helped a 
ton here.

A similar use case is large-scale processing of text data (e.g. web pages, 
tweets, blog posts, etc).  java.lang.String is not very portable (noted below) 
and direct access to string buffers (especially if they're in proper UTF 
format) is very desirable.  Direct access to UTF-8 could also benefit Python 
support.

A major advantage of *in-process* native code (as opposed to, say, using 
`RDD.pipe()`) is that exceptions can get propagated, logged, and handled by 
Spark.  This feature alone IMO warrants the software cost of in-process native 
code.  Unfortunately, properly handling JNI-related exceptions and other 
nuances is tricky and a major pain.  I recommend Djinni, which helps a ton here 
(and is used in consumer mobile apps): https://bit.ly/djinnitalk   Furthermore, 
Djinni also recently added a type-marshaling feature that enables zero-copy 
type translation.  (The default type marshaling does deep copying). 

Some related issues:

 * Spark's BlockManager makes use of on-heap byte buffers for e.g. compression. 
 On-heap byte arrays are *not* necessarily zero-copy (the JVM is allowed to 
copy data in a JNI  `GetPrimitiveArrayCritical()` call; FMI see some discussion 
https://github.com/dropbox/djinni/issues/54 ).  A complete solution to this 
JIRA may necessitate some changes to Spark's core serializer API.  (In 
particular, it might be nice to have a code path that avoids any temporary 
on-heap buffers).

 * While Spark's Unsafe UTF-8 Strings are likely portable, java.lang.String is 
*not* particularly portable to C++: 
https://github.com/dropbox/djinni/blob/master/support-lib/jni/djinni_support.cpp#L431
  I've microbenchmarked that code and found it to be major overhead.  A 
solution to this JIRA might need some subtle API changes to encourage/help 
users avoid Java Strings.

 * Shipping and running a native library on a cluster is tricky.  Containers / 
virtualization (e.g. Docker) can help ensure the availability of dependencies, 
but sometimes those technologies aren't available.  One can compile all 
dependences (i.e. including libc++) into a single dynamic library, but that 
takes some special build set-up.  On-executor, dynamic code compilation (e.g. 
through Cling https://root.cern.ch/cling ) would be desirable but is probably 
beyond the scope of this JIRA.  I'm hoping to contribute a change to Djinni 
soon ( 
https://github.com/dropbox/djinni/compare/master...pwais:pwais_linux_build ) 
that will address the common use case where one simply wants to ship and run 
(on Spark) an app jar that contains a native library (and use system 
dependencies).



Are there any followers of this JIRA who have specific API requests?  My take 
on this issue is that there are a few main components:
  * Ensuring the accessibility of UnsafeRow to user code (which would then 
invoke native code).  (It's not clear to me that this is already part of Spark 
1.5; DataFrames simply interop with Row).  
  * Creating a byte buffer 'view' that's similar to UTF8String for buffer row 
attributes.  `UnSafeRow.getBytes()` currently deep-copies (into an on-heap 
array) and we'd want a 'view' of the bytes instead.
  * Define and implement core type mappers.  E.g. Spark UTF8String <-> 
std::string.  It might be nice for "Spark C++" types to be simple arrays (e.g. 
(pointer, length, nullable deleter)) with adapters to standard types (e.g. 
std::string and std::vector).  The deleter part is important if native code can 
be allowed to consume (and gain ownership) of data; a full solution needs a 
'move' API component.

With those pieces in place (and especially if any "Spark C++ support code" is 
header-only), it wouldn't be too hard for users to build & package Spark jars 
w/ native libs as they please. As mentioned above, I'd recommend Djinni as a 
facilitator to this project (and as a facilitator to users who want to write & 
deploy native libs).

There are some other misc issues:
 * Is Unsafe memory always aligned? If not, how can we flag this to native code?
 * As mentioned above, can we modify BlockManager to have a path that skips any 
on-heap buffers?
 * If native code *does* need to use substantial memory, how can it communicate 
that

[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2015-09-14 Thread Paul Wais (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744634#comment-14744634
 ] 

Paul Wais commented on SPARK-10399:
---

After investigating this issue a bit further, it might be feasible to expose 
*on*-heap Spark memory (without a copy) to native code through the 
{Get,Release}*Critical() JNI interface.  Android[1] uses this interface for 
copying on-heap data to devices (e.g. the GPU).  It's important to note that 
the interface is not necessarily zero-copy and will cause some JVMs to block GC 
(e.g. Hotspot [2])-- could lead to longer Spark GC pauses?  In any case, this 
feature might help expose the individual elements of an RDD to native code 
without any major changes to Spark (e.g. to the BlockManager).

Nevertheless, native code would ideally not run a JNI call per-item (e.g. per 
row) and instead could get access to a segment of rows or an entire partition.  
However, blocking the GC while processing an entire partition would probably 
not work well in practice...

[1] 
https://github.com/android/platform_frameworks_base/search?p=3&q=GetPrimitiveArrayCritical&utf8=%E2%9C%93
[2] 
https://github.com/openjdk-mirror/jdk7u-hotspot/blob/50bdefc3afe944ca74c3093e7448d6b889cd20d1/src/share/vm/prims/jni.cpp#L4235

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2016-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177763#comment-15177763
 ] 

Apache Spark commented on SPARK-10399:
--

User 'yzotov' has created a pull request for this issue:
https://github.com/apache/spark/pull/11494

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org