Hi All, I have written a Scala package which essentially wraps the SparkContext around a custom class that adds some functionality specific to our internal use case. I am trying to figure out the best way to call this from PySpark.
I would like to do this similarly to how Spark itself calls the JVM SparkContext as in: https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py My goal would be something like this: Scala Code (this is done): >>> import com.company.mylibrary.CustomContext >>> val myContext = CustomContext(sc) >>> val rdd: RDD[String] = myContext.customTextFile("path") Python Code (I want to be able to do this): >>> from company.mylibrary import CustomContext >>> myContext = CustomContext(sc) >>> rdd = myContext.customTextFile("path") At the end of each code, I should be working with an ordinary RDD[String]. I am trying to access my Scala class through sc._jvm as below, but not having any luck so far. My attempts: >>> a = sc._jvm.com.company.mylibrary.CustomContext >>> dir(a) ['<package or class name>'] Example of what I want:: >>> a = sc._jvm.PythonRDD >>> dir(a) ['anonfun$6', 'anonfun$8', 'collectAndServe', 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile', 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD', 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions', 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions', 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions', 'readBroadcastFromFile', 'readRDDFromFile', 'runJob', 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile', 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair', 'writeIteratorToStream', 'writeUTF'] The next thing I would run into is converting the JVM RDD[String] back to a Python RDD, what is the easiest way to do this? Overall, is this a good approach to calling the same API in Scala and Python? -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience