Calling Scala/Java methods which operates on RDD

Jai Kumar Singh Fri, 11 Jul 2014 08:22:22 -0700

HI,
  I want to write some common utility function in Scala and want to call
the same from Java/Python Spark API ( may be add some wrapper code around
scala calls). Calling Scala functions from Java works fine. I was reading
pyspark rdd code and find out that pyspark is able to call JavaRDD function
like union/zip to get same for pyspark RDD and deserializing the output and
everything works fine. But somehow I am
not able to work out really simple example. I think I am missing some
serialization/deserialization.


Can someone confirm that is it even possible to do so? Or, would it be much
easier to pass RDD data files around instead of RDD directly (from pyspark
to java/scala)?

For example, below code just add 1 to each element of RDD containing
Integers.

package flukebox.test;

object TestClass{

def testFunc(data:RDD[Int])={

  data.map(x => x+1)

}

}

Calling from python,

from pyspark import RDD

from py4j.java_gateway import java_import

java_import(sc._gateway.jvm, "flukebox.test")


data = sc.parallelize([1,2,3,4,5,6,7,8,9])

sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd())


*This fails because testFunc get any RDD of type Byte Array.*


Any help/pointer would be highly appreciated.


Thanks & Regards,

Jai K Singh

Calling Scala/Java methods which operates on RDD

Reply via email to