Re: Scala driver, Python workers?

2013-12-12 Thread Ewen Cheslack-Postava
Obviously it depends on 
what is missing, but if I were you, I'd try monkey patching pyspark with
 the functionality you need first (along with submitting a pull request,
 of course). The pyspark code is very readable, and a lot of 
functionality just builds on top of a few primitives, as in the Scala 
Spark code. And in many cases you can use the Scala version for 
reference. For example, compare RDD.distinct() in Scala 
(https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L263)
 and Python 
(https://github.com/apache/incubator-spark/blob/master/python/pyspark/rdd.py#L175)
 (the Python version is missing numPartitions, but that looks like a 
trivial fix in this case).

-Ewen


   	   
   	Patrick Grinaway  
  December 11, 2013
 8:57 PM
  Hi all,I've
 been mostly using Spark with Python, and it's been a great time (thanks
 for the earlier help with GPUs, btw), but I recently stumbled through 
the Scala API and found it incredibly rich, with some options that would
 be pretty helpful for us but are lacking in the Python API. Is it 
straightforward to write a driver in Scala, but have the workers be 
written in Python? Alternatively, can I (easily) use Py4J to access 
these Scala methods from Python? I imagine I'll be playing around with 
it over the next few days, but I was wondering if anyone had tried this.
 Sorry if it's a stupid question...
Thanks for the time and attentionPatrick
 Grinaway

  




Re: Scala driver, Python workers?

2013-12-12 Thread Matei Zaharia
Yeah, I’m curious which APIs you found missing in Python. I know we have a lot on the Scala side that aren’t yet in there, but I’m not sure how to prioritize them.If you do want to call Python from Scala, you can also use the RDD.pipe() operation to pass data through an external process. However it won’t be as nice as writing the whole driver program in Python.MateiOn Dec 12, 2013, at 9:03 AM, Ewen Cheslack-Postava echesl...@gmail.com wrote:

Obviously it depends on 
what is missing, but if I were you, I'd try monkey patching pyspark with
 the functionality you need first (along with submitting a pull request,
 of course). The pyspark code is very readable, and a lot of 
functionality just builds on top of a few primitives, as in the Scala 
Spark code. And in many cases you can use the Scala version for 
reference. For example, compare RDD.distinct() in Scala 
(https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L263)
 and Python 
(https://github.com/apache/incubator-spark/blob/master/python/pyspark/rdd.py#L175)
 (the Python version is missing numPartitions, but that looks like a 
trivial fix in this case).

-Ewen


   	   
   	Patrick Grinaway  
  December 11, 2013
 8:57 PM
  Hi all,I've
 been mostly using Spark with Python, and it's been a great time (thanks
 for the earlier help with GPUs, btw), but I recently stumbled through 
the Scala API and found it incredibly rich, with some options that would
 be pretty helpful for us but are lacking in the Python API. Is it 
straightforward to write a driver in Scala, but have the workers be 
written in Python? Alternatively, can I (easily) use Py4J to access 
these Scala methods from Python? I imagine I'll be playing around with 
it over the next few days, but I was wondering if anyone had tried this.
 Sorry if it's a stupid question...
Thanks for the time and attentionPatrick
 Grinaway

  




Re: Scala driver, Python workers?

2013-12-12 Thread Patrick Grinaway
I'm going to try what Ewen suggested--the Python wrappers seem pretty straightforward to understand and very readable. In particular, I am interested in SparkContext.hadoopRDD() and RDD.saveAsTextFile() (with compression).To elaborate on the first count, I'd like to be able to take XML files in HDFS and easily bring them into an RDD, but it looks like the Python SparkContext.textFile() will split up a file by lines, whereas with SparkContext.hadoopRDD() I can use XmlInputFormat(please correct me if I've misread the documentation here).Thanks for the help!PatrickOn Dec 12, 2013, at 1:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote:Yeah, I’m curious which APIs you found missing in Python. I know we have a lot on the Scala side that aren’t yet in there, but I’m not sure how to prioritize them.If you do want to call Python from Scala, you can also use the RDD.pipe() operation to pass data through an external process. However it won’t be as nice as writing the whole driver program in Python.MateiOn Dec 12, 2013, at 9:03 AM, Ewen Cheslack-Postava echesl...@gmail.com wrote:

Obviously it depends on 
what is missing, but if I were you, I'd try monkey patching pyspark with
 the functionality you need first (along with submitting a pull request,
 of course). The pyspark code is very readable, and a lot of 
functionality just builds on top of a few primitives, as in the Scala 
Spark code. And in many cases you can use the Scala version for 
reference. For example, compare RDD.distinct() in Scala 
(https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L263)
 and Python 
(https://github.com/apache/incubator-spark/blob/master/python/pyspark/rdd.py#L175)
 (the Python version is missing numPartitions, but that looks like a 
trivial fix in this case).

-Ewen


   	   
   	Patrick Grinaway  
  December 11, 2013
 8:57 PM
  Hi all,I've
 been mostly using Spark with Python, and it's been a great time (thanks
 for the earlier help with GPUs, btw), but I recently stumbled through 
the Scala API and found it incredibly rich, with some options that would
 be pretty helpful for us but are lacking in the Python API. Is it 
straightforward to write a driver in Scala, but have the workers be 
written in Python? Alternatively, can I (easily) use Py4J to access 
these Scala methods from Python? I imagine I'll be playing around with 
it over the next few days, but I was wondering if anyone had tried this.
 Sorry if it's a stupid question...
Thanks for the time and attentionPatrick
 Grinaway