Re: Iterator over RDD in PySpark

2014-08-02 Thread Andrei
Excellent, thank you! On Sat, Aug 2, 2014 at 4:46 AM, Aaron Davidson ilike...@gmail.com wrote: Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it =

Iterator over RDD in PySpark

2014-08-01 Thread Andrei
Is there a way to get iterator from RDD? Something like rdd.collect(), but returning lazy sequence and not single array. Context: I need to GZip processed data to upload it to Amazon S3. Since archive should be a single file, I want to iterate over RDD, writing each line to a local .gz file. File

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote: Is there a way to get iterator from RDD?

Re: Iterator over RDD in PySpark

2014-08-01 Thread Andrei
Thanks, Aaron, it should be fine with partitions (I can repartition it anyway, right?). But rdd.toLocalIterator is purely Java/Scala method. Is there Python interface to it? I can get Java iterator though rdd._jrdd, but it isn't converted to Python iterator automatically. E.g.: rdd =

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei faithlessfri...@gmail.com wrote: