Dear List, Has anybody had experience integrating C/C++ code into Spark jobs?
I have done some work on this topic using JNA. I wrote a FlatMapFunction that processes all partition entries using a C++ library. This approach works well, but there are some tradeoffs: * Shipping the native dylib with the app jar and loading it at runtime requires a bit of work (on top of normal JNA usage) * Native code doesn't respect the executor heap limits. Under heavy memory pressure, the native code can sometimes ENOMEM sporadically. * While JNA can map Strings, structs, and Java primitive types, the user still needs to deal with more complex objects. E.g. re-serialize protobuf/thrift objects, or provide some other encoding for moving data between Java and C/C++. * C++ static is not thread-safe before C++11, so the user sometimes needs to take care running inside multi-threaded executors * Avoiding memory copies can be a little tricky One other alternative approach comes to mind is pipe(). However, PipedRDD requires copying data over pipes, does not support binary data (?), and native code errors that crash the subprocess don't bubble up to the Spark job as nicely as with JNA. Is there a way to expose raw, in-memory partition/block data to native code? Has anybody else attacked this problem a different way? All the best, -Paul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org