subject:"Iterator over RDD in PySpark"

Re: Iterator over RDD in PySpark

2014-08-02 Thread Andrei

Excellent, thank you! On Sat, Aug 2, 2014 at 4:46 AM, Aaron Davidson ilike...@gmail.com wrote: Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it =

Iterator over RDD in PySpark

2014-08-01 Thread Andrei

Is there a way to get iterator from RDD? Something like rdd.collect(), but returning lazy sequence and not single array. Context: I need to GZip processed data to upload it to Amazon S3. Since archive should be a single file, I want to iterate over RDD, writing each line to a local .gz file. File

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson

rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote: Is there a way to get iterator from RDD?

Re: Iterator over RDD in PySpark

2014-08-01 Thread Andrei

Thanks, Aaron, it should be fine with partitions (I can repartition it anyway, right?). But rdd.toLocalIterator is purely Java/Scala method. Is there Python interface to it? I can get Java iterator though rdd._jrdd, but it isn't converted to Python iterator automatically. E.g.: rdd =

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson

Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei faithlessfri...@gmail.com wrote:

Re: Iterator over RDD in PySpark

Iterator over RDD in PySpark

Re: Iterator over RDD in PySpark

Re: Iterator over RDD in PySpark

Re: Iterator over RDD in PySpark

5 matches

Site Navigation

Mail list logo

Footer information