Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts.  I took a deeper look at BlockManager, RDD, and friends. 
Suppose one wanted to get native code access to un-deserialized blocks. 
This task looks very hard.  An RDD behaves much like a Scala iterator of
deserialized values, and interop with BlockManager is all on deserialized
data.  One would probably need to rewrite much of RDD, CacheManager, etc in
native code; an RDD subclass (e.g. PythonRDD) probably wouldn't work.

So exposing raw blocks to native code looks intractable.  I wonder how fast
Java/Kyro can SerDe of byte arrays.  E.g. suppose we have an RDDT where T
is immutable and most of the memory for a single T is a byte array.  What is
the overhead of SerDe-ing T?  (Does Java/Kyro copy the underlying memory?) 
If the overhead is small, then native access to raw blocks wouldn't really
yield any advantage.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347p18640.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Native / C/C++ code integration

2014-11-07 Thread Paul Wais
Dear List,

Has anybody had experience integrating C/C++ code into Spark jobs?  

I have done some work on this topic using JNA.  I wrote a FlatMapFunction
that processes all partition entries using a C++ library.  This approach
works well, but there are some tradeoffs:
 * Shipping the native dylib with the app jar and loading it at runtime
requires a bit of work (on top of normal JNA usage)
 * Native code doesn't respect the executor heap limits.  Under heavy memory
pressure, the native code can sometimes ENOMEM sporadically.
 * While JNA can map Strings, structs, and Java primitive types, the user
still needs to deal with more complex objects.  E.g. re-serialize
protobuf/thrift objects, or provide some other encoding for moving data
between Java and C/C++.
 * C++ static is not thread-safe before C++11, so the user sometimes needs
to take care running inside multi-threaded executors
 * Avoiding memory copies can be a little tricky

One other alternative approach comes to mind is pipe().  However, PipedRDD
requires copying data over pipes, does not support binary data (?), and
native code errors that crash the subprocess don't bubble up to the Spark
job as nicely as with JNA.

Is there a way to expose raw, in-memory partition/block data to native code?

Has anybody else attacked this problem a different way?

All the best,
-Paul 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org