subject:"Re\: Iterator over RDD in PySpark"

Re: Iterator over RDD in PySpark

2014-08-02 Thread Andrei

Excellent, thank you! On Sat, Aug 2, 2014 at 4:46 AM, Aaron Davidson ilike...@gmail.com wrote: Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it =

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson

rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote: Is there a way to get iterator from RDD?

Re: Iterator over RDD in PySpark

2014-08-01 Thread Andrei

Thanks, Aaron, it should be fine with partitions (I can repartition it anyway, right?). But rdd.toLocalIterator is purely Java/Scala method. Is there Python interface to it? I can get Java iterator though rdd._jrdd, but it isn't converted to Python iterator automatically. E.g.: rdd =

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson

Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei faithlessfri...@gmail.com wrote: