Call Scala API from PySpark

Pedro Rodriguez Thu, 30 Jun 2016 09:53:40 -0700

Hi All,

I have written a Scala package which essentially wraps the SparkContext
around a custom class that adds some functionality specific to our internal
use case. I am trying to figure out the best way to call this from PySpark.


I would like to do this similarly to how Spark itself calls the JVM
SparkContext as in:
https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py

My goal would be something like this:

Scala Code (this is done):
>>> import com.company.mylibrary.CustomContext
>>> val myContext = CustomContext(sc)
>>> val rdd: RDD[String] = myContext.customTextFile("path")

Python Code (I want to be able to do this):
>>> from company.mylibrary import CustomContext
>>> myContext = CustomContext(sc)
>>> rdd = myContext.customTextFile("path")

At the end of each code, I should be working with an ordinary RDD[String].

I am trying to access my Scala class through sc._jvm as below, but not
having any luck so far.

My attempts:
>>> a = sc._jvm.com.company.mylibrary.CustomContext
>>> dir(a)
['<package or class name>']

Example of what I want::
>>> a = sc._jvm.PythonRDD
>>> dir(a)
['anonfun$6', 'anonfun$8', 'collectAndServe',
'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile',
'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD',
'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions',
'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions',
'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions',
'readBroadcastFromFile', 'readRDDFromFile', 'runJob',
'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile',
'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair',
'writeIteratorToStream', 'writeUTF']

The next thing I would run into is converting the JVM RDD[String] back to a
Python RDD, what is the easiest way to do this?

Overall, is this a good approach to calling the same API in Scala and
Python?

-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Call Scala API from PySpark

Reply via email to