I'm going to try what Ewen suggested--the Python wrappers seem pretty straightforward to understand and very readable. In particular, I am interested in SparkContext.hadoopRDD() and RDD.saveAsTextFile() (with compression). To elaborate on the first count, I'd like to be able to take XML files in HDFS and easily bring them into an RDD, but it looks like the Python SparkContext.textFile() will split up a file by lines, whereas with SparkContext.hadoopRDD() I can use XmlInputFormat (please correct me if I've misread the documentation here). Thanks for the help! Patrick On Dec 12, 2013, at 1:24 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
|
- Scala driver, Python workers? Patrick Grinaway
- Re: Scala driver, Python workers? Ewen Cheslack-Postava
- Re: Scala driver, Python workers? Matei Zaharia
- Re: Scala driver, Python workers? Patrick Grinaway