Re: Call Scala API from PySpark

Jeff Zhang Thu, 30 Jun 2016 11:07:38 -0700

Hi Pedro,

Your use case is interesting.  I think launching java gateway is the same
as native SparkContext, the only difference is on creating your custom
SparkContext instead of native SparkContext. You might also need to wrap it
using java.


https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172



On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> Hi All,
>
> I have written a Scala package which essentially wraps the SparkContext
> around a custom class that adds some functionality specific to our internal
> use case. I am trying to figure out the best way to call this from PySpark.
>
> I would like to do this similarly to how Spark itself calls the JVM
> SparkContext as in:
> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py
>
> My goal would be something like this:
>
> Scala Code (this is done):
> >>> import com.company.mylibrary.CustomContext
> >>> val myContext = CustomContext(sc)
> >>> val rdd: RDD[String] = myContext.customTextFile("path")
>
> Python Code (I want to be able to do this):
> >>> from company.mylibrary import CustomContext
> >>> myContext = CustomContext(sc)
> >>> rdd = myContext.customTextFile("path")
>
> At the end of each code, I should be working with an ordinary RDD[String].
>
> I am trying to access my Scala class through sc._jvm as below, but not
> having any luck so far.
>
> My attempts:
> >>> a = sc._jvm.com.company.mylibrary.CustomContext
> >>> dir(a)
> ['<package or class name>']
>
> Example of what I want::
> >>> a = sc._jvm.PythonRDD
> >>> dir(a)
> ['anonfun$6', 'anonfun$8', 'collectAndServe',
> 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile',
> 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD',
> 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions',
> 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions',
> 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions',
> 'readBroadcastFromFile', 'readRDDFromFile', 'runJob',
> 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile',
> 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair',
> 'writeIteratorToStream', 'writeUTF']
>
> The next thing I would run into is converting the JVM RDD[String] back to
> a Python RDD, what is the easiest way to do this?
>
> Overall, is this a good approach to calling the same API in Scala and
> Python?
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
Best Regards

Jeff Zhang

Re: Call Scala API from PySpark

Reply via email to