Re: Call Scala API from PySpark

Pedro Rodriguez Thu, 30 Jun 2016 13:14:18 -0700

Thanks Jeff and Holden,

A little more context here probably helps. I am working on implementing the
idea from this article to make reads from S3 faster:
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
(although my name is Pedro, I am not the author of the article). The reason
for wrapping SparkContext is so that the code change is from sc.textFile to
sc.s3TextFile in addition to configuring AWS keys correctly (seeing if we
can open source our library, but depends on company). Overall, its a very
light wrapper and perhaps calling it a context is not quite the right name
because of that.


At the end of the day I make a sc.parallelize call and return an
RDD[String] as described in that blog post. I found a post from Py4J
mailing list that reminded my that the JVM gateway needs the jars in
spark.driver/executor.extraClassPath in addition to the spark.jars option.
With that, I can see the classes now. Looks like I need to do as you
suggest and wrap it using Java in order to go the last mile to calling the
method/constructor. I don't know yet how to get the RDD back to pyspark
though so any pointers on that would be great.

Thanks for the tip on code Holden, I will take a look to see if that can
give me some insight on how to write the Python code part.

Thanks!
Pedro

On Thu, Jun 30, 2016 at 12:23 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> So I'm a little biased - I think the bet bride between the two is using
> DataFrames. I've got some examples in my talk and on the high performance
> spark GitHub
> https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py
> calls some custom scala code.
>
> Using a custom context is a bit trixie though because of how the launching
> is done, as Jeff Zhang points out you would need to wrap it in a
> JavaSparkContext and then you could override the _intialize_context
> function in context.py
>
> On Thu, Jun 30, 2016 at 11:06 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Hi Pedro,
>>
>> Your use case is interesting.  I think launching java gateway is the same
>> as native SparkContext, the only difference is on creating your custom
>> SparkContext instead of native SparkContext. You might also need to wrap it
>> using java.
>>
>> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172
>>
>>
>>
>> On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com
>> > wrote:
>>
>>> Hi All,
>>>
>>> I have written a Scala package which essentially wraps the SparkContext
>>> around a custom class that adds some functionality specific to our internal
>>> use case. I am trying to figure out the best way to call this from PySpark.
>>>
>>> I would like to do this similarly to how Spark itself calls the JVM
>>> SparkContext as in:
>>> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py
>>>
>>> My goal would be something like this:
>>>
>>> Scala Code (this is done):
>>> >>> import com.company.mylibrary.CustomContext
>>> >>> val myContext = CustomContext(sc)
>>> >>> val rdd: RDD[String] = myContext.customTextFile("path")
>>>
>>> Python Code (I want to be able to do this):
>>> >>> from company.mylibrary import CustomContext
>>> >>> myContext = CustomContext(sc)
>>> >>> rdd = myContext.customTextFile("path")
>>>
>>> At the end of each code, I should be working with an ordinary
>>> RDD[String].
>>>
>>> I am trying to access my Scala class through sc._jvm as below, but not
>>> having any luck so far.
>>>
>>> My attempts:
>>> >>> a = sc._jvm.com.company.mylibrary.CustomContext
>>> >>> dir(a)
>>> ['<package or class name>']
>>>
>>> Example of what I want::
>>> >>> a = sc._jvm.PythonRDD
>>> >>> dir(a)
>>> ['anonfun$6', 'anonfun$8', 'collectAndServe',
>>> 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile',
>>> 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD',
>>> 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions',
>>> 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions',
>>> 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions',
>>> 'readBroadcastFromFile', 'readRDDFromFile', 'runJob',
>>> 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile',
>>> 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair',
>>> 'writeIteratorToStream', 'writeUTF']
>>>
>>> The next thing I would run into is converting the JVM RDD[String] back
>>> to a Python RDD, what is the easiest way to do this?
>>>
>>> Overall, is this a good approach to calling the same API in Scala and
>>> Python?
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Call Scala API from PySpark

Reply via email to