I'm going to try what Ewen suggested--the Python wrappers seem pretty straightforward to understand and very readable. In particular, I am interested in SparkContext.hadoopRDD() and RDD.saveAsTextFile() (with compression). 

To elaborate on the first count, I'd like to be able to take XML files in HDFS and easily bring them into an RDD, but it looks like the Python SparkContext.textFile() will split up a file by lines, whereas with SparkContext.hadoopRDD() I can use XmlInputFormat (please correct me if I've misread the documentation here).

Thanks for the help!

Patrick

On Dec 12, 2013, at 1:24 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:

Yeah, I’m curious which APIs you found missing in Python. I know we have a lot on the Scala side that aren’t yet in there, but I’m not sure how to prioritize them.

If you do want to call Python from Scala, you can also use the RDD.pipe() operation to pass data through an external process. However it won’t be as nice as writing the whole driver program in Python.

Matei

On Dec 12, 2013, at 9:03 AM, Ewen Cheslack-Postava <echesl...@gmail.com> wrote:

Obviously it depends on what is missing, but if I were you, I'd try monkey patching pyspark with the functionality you need first (along with submitting a pull request, of course). The pyspark code is very readable, and a lot of functionality just builds on top of a few primitives, as in the Scala Spark code. And in many cases you can use the Scala version for reference. For example, compare RDD.distinct() in Scala (https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L263) and Python (https://github.com/apache/incubator-spark/blob/master/python/pyspark/rdd.py#L175) (the Python version is missing numPartitions, but that looks like a trivial fix in this case).

-Ewen

December 11, 2013 8:57 PM
Hi all,

I've been mostly using Spark with Python, and it's been a great time (thanks for the earlier help with GPUs, btw), but I recently stumbled through the Scala API and found it incredibly rich, with some options that would be pretty helpful for us but are lacking in the Python API. Is it straightforward to write a driver in Scala, but have the workers be written in Python? Alternatively, can I (easily) use Py4J to access these Scala methods from Python? I imagine I'll be playing around with it over the next few days, but I was wondering if anyone had tried this. Sorry if it's a stupid question...

Thanks for the time and attention

Patrick Grinaway


Reply via email to