Excellent, thank you!
On Sat, Aug 2, 2014 at 4:46 AM, Aaron Davidson ilike...@gmail.com wrote:
Ah, that's unfortunate, that definitely should be added. Using a
pyspark-internal method, you could try something like
javaIterator = rdd._jrdd.toLocalIterator()
it =
Is there a way to get iterator from RDD? Something like rdd.collect(), but
returning lazy sequence and not single array.
Context: I need to GZip processed data to upload it to Amazon S3. Since
archive should be a single file, I want to iterate over RDD, writing each
line to a local .gz file. File
rdd.toLocalIterator will do almost what you want, but requires that each
individual partition fits in memory (rather than each individual line).
Hopefully that's sufficient, though.
On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote:
Is there a way to get iterator from RDD?
Thanks, Aaron, it should be fine with partitions (I can repartition it
anyway, right?).
But rdd.toLocalIterator is purely Java/Scala method. Is there Python
interface to it?
I can get Java iterator though rdd._jrdd, but it isn't converted to Python
iterator automatically. E.g.:
rdd =
Ah, that's unfortunate, that definitely should be added. Using a
pyspark-internal method, you could try something like
javaIterator = rdd._jrdd.toLocalIterator()
it = rdd._collect_iterator_through_file(javaIterator)
On Fri, Aug 1, 2014 at 3:04 PM, Andrei faithlessfri...@gmail.com wrote: