Yeah, I’m curious which APIs you found missing in Python. I know we have a lot on the Scala side that aren’t yet in there, but I’m not sure how to prioritize them.

If you do want to call Python from Scala, you can also use the RDD.pipe() operation to pass data through an external process. However it won’t be as nice as writing the whole driver program in Python.

Matei

On Dec 12, 2013, at 9:03 AM, Ewen Cheslack-Postava <echesl...@gmail.com> wrote:

Obviously it depends on what is missing, but if I were you, I'd try monkey patching pyspark with the functionality you need first (along with submitting a pull request, of course). The pyspark code is very readable, and a lot of functionality just builds on top of a few primitives, as in the Scala Spark code. And in many cases you can use the Scala version for reference. For example, compare RDD.distinct() in Scala (https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L263) and Python (https://github.com/apache/incubator-spark/blob/master/python/pyspark/rdd.py#L175) (the Python version is missing numPartitions, but that looks like a trivial fix in this case).

-Ewen

December 11, 2013 8:57 PM
Hi all,

I've been mostly using Spark with Python, and it's been a great time (thanks for the earlier help with GPUs, btw), but I recently stumbled through the Scala API and found it incredibly rich, with some options that would be pretty helpful for us but are lacking in the Python API. Is it straightforward to write a driver in Scala, but have the workers be written in Python? Alternatively, can I (easily) use Py4J to access these Scala methods from Python? I imagine I'll be playing around with it over the next few days, but I was wondering if anyone had tried this. Sorry if it's a stupid question...

Thanks for the time and attention

Patrick Grinaway

Reply via email to