HI, I want to write some common utility function in Scala and want to call the same from Java/Python Spark API ( may be add some wrapper code around scala calls). Calling Scala functions from Java works fine. I was reading pyspark rdd code and find out that pyspark is able to call JavaRDD function like union/zip to get same for pyspark RDD and deserializing the output and everything works fine. But somehow I am not able to work out really simple example. I think I am missing some serialization/deserialization.
Can someone confirm that is it even possible to do so? Or, would it be much easier to pass RDD data files around instead of RDD directly (from pyspark to java/scala)? For example, below code just add 1 to each element of RDD containing Integers. package flukebox.test; object TestClass{ def testFunc(data:RDD[Int])={ data.map(x => x+1) } } Calling from python, from pyspark import RDD from py4j.java_gateway import java_import java_import(sc._gateway.jvm, "flukebox.test") data = sc.parallelize([1,2,3,4,5,6,7,8,9]) sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd()) *This fails because testFunc get any RDD of type Byte Array.* Any help/pointer would be highly appreciated. Thanks & Regards, Jai K Singh