Quick update, I was able to get most of the plumbing to work thanks to the code Holden posted and browsing more source code.
I am running into this error which makes me think that maybe I shouldn't be leaving the default python RDD serializer/pickler in place and do something else https://github.com/apache/spark/blob/v1.6.2/python/pyspark/rdd.py#L182: _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified. On Thu, Jun 30, 2016 at 2:13 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Thanks Jeff and Holden, > > A little more context here probably helps. I am working on implementing > the idea from this article to make reads from S3 faster: > http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 > (although my name is Pedro, I am not the author of the article). The reason > for wrapping SparkContext is so that the code change is from sc.textFile to > sc.s3TextFile in addition to configuring AWS keys correctly (seeing if we > can open source our library, but depends on company). Overall, its a very > light wrapper and perhaps calling it a context is not quite the right name > because of that. > > At the end of the day I make a sc.parallelize call and return an > RDD[String] as described in that blog post. I found a post from Py4J > mailing list that reminded my that the JVM gateway needs the jars in > spark.driver/executor.extraClassPath in addition to the spark.jars option. > With that, I can see the classes now. Looks like I need to do as you > suggest and wrap it using Java in order to go the last mile to calling the > method/constructor. I don't know yet how to get the RDD back to pyspark > though so any pointers on that would be great. > > Thanks for the tip on code Holden, I will take a look to see if that can > give me some insight on how to write the Python code part. > > Thanks! > Pedro > > On Thu, Jun 30, 2016 at 12:23 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> So I'm a little biased - I think the bet bride between the two is using >> DataFrames. I've got some examples in my talk and on the high performance >> spark GitHub >> https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py >> calls some custom scala code. >> >> Using a custom context is a bit trixie though because of how the >> launching is done, as Jeff Zhang points out you would need to wrap it in a >> JavaSparkContext and then you could override the _intialize_context >> function in context.py >> >> On Thu, Jun 30, 2016 at 11:06 AM, Jeff Zhang <zjf...@gmail.com> wrote: >> >>> Hi Pedro, >>> >>> Your use case is interesting. I think launching java gateway is the >>> same as native SparkContext, the only difference is on creating your custom >>> SparkContext instead of native SparkContext. You might also need to wrap it >>> using java. >>> >>> >>> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172 >>> >>> >>> >>> On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez < >>> ski.rodrig...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I have written a Scala package which essentially wraps the SparkContext >>>> around a custom class that adds some functionality specific to our internal >>>> use case. I am trying to figure out the best way to call this from PySpark. >>>> >>>> I would like to do this similarly to how Spark itself calls the JVM >>>> SparkContext as in: >>>> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py >>>> >>>> My goal would be something like this: >>>> >>>> Scala Code (this is done): >>>> >>> import com.company.mylibrary.CustomContext >>>> >>> val myContext = CustomContext(sc) >>>> >>> val rdd: RDD[String] = myContext.customTextFile("path") >>>> >>>> Python Code (I want to be able to do this): >>>> >>> from company.mylibrary import CustomContext >>>> >>> myContext = CustomContext(sc) >>>> >>> rdd = myContext.customTextFile("path") >>>> >>>> At the end of each code, I should be working with an ordinary >>>> RDD[String]. >>>> >>>> I am trying to access my Scala class through sc._jvm as below, but not >>>> having any luck so far. >>>> >>>> My attempts: >>>> >>> a = sc._jvm.com.company.mylibrary.CustomContext >>>> >>> dir(a) >>>> ['<package or class name>'] >>>> >>>> Example of what I want:: >>>> >>> a = sc._jvm.PythonRDD >>>> >>> dir(a) >>>> ['anonfun$6', 'anonfun$8', 'collectAndServe', >>>> 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile', >>>> 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD', >>>> 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions', >>>> 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions', >>>> 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions', >>>> 'readBroadcastFromFile', 'readRDDFromFile', 'runJob', >>>> 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile', >>>> 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair', >>>> 'writeIteratorToStream', 'writeUTF'] >>>> >>>> The next thing I would run into is converting the JVM RDD[String] back >>>> to a Python RDD, what is the easiest way to do this? >>>> >>>> Overall, is this a good approach to calling the same API in Scala and >>>> Python? >>>> >>>> -- >>>> Pedro Rodriguez >>>> PhD Student in Distributed Machine Learning | CU Boulder >>>> UC Berkeley AMPLab Alumni >>>> >>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>>> Github: github.com/EntilZha | LinkedIn: >>>> https://www.linkedin.com/in/pedrorodriguezscience >>>> >>>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> > > > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience