Obviously it depends on
what is missing, but if I were you, I'd try monkey patching pyspark with
the functionality you need first (along with submitting a pull request,
of course). The pyspark code is very readable, and a lot of
functionality just builds on top of a few primitives, as in the Scala
Spark code. And in many cases you can use the Scala version for
reference. For example, compare RDD.distinct() in Scala
(https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L263)
and Python
(https://github.com/apache/incubator-spark/blob/master/python/pyspark/rdd.py#L175)
(the Python version is missing numPartitions, but that looks like a
trivial fix in this case). -Ewen
|
- Scala driver, Python workers? Patrick Grinaway
- Re: Scala driver, Python workers? Ewen Cheslack-Postava
- Re: Scala driver, Python workers? Matei Zaharia
- Re: Scala driver, Python workers? Patrick Grinaway