Dear List,

Has anybody had experience integrating C/C++ code into Spark jobs?  

I have done some work on this topic using JNA.  I wrote a FlatMapFunction
that processes all partition entries using a C++ library.  This approach
works well, but there are some tradeoffs:
 * Shipping the native dylib with the app jar and loading it at runtime
requires a bit of work (on top of normal JNA usage)
 * Native code doesn't respect the executor heap limits.  Under heavy memory
pressure, the native code can sometimes ENOMEM sporadically.
 * While JNA can map Strings, structs, and Java primitive types, the user
still needs to deal with more complex objects.  E.g. re-serialize
protobuf/thrift objects, or provide some other encoding for moving data
between Java and C/C++.
 * C++ static is not thread-safe before C++11, so the user sometimes needs
to take care running inside multi-threaded executors
 * Avoiding memory copies can be a little tricky

One other alternative approach comes to mind is pipe().  However, PipedRDD
requires copying data over pipes, does not support binary data (?), and
native code errors that crash the subprocess don't bubble up to the Spark
job as nicely as with JNA.

Is there a way to expose raw, in-memory partition/block data to native code?

Has anybody else attacked this problem a different way?

All the best,
-Paul 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to