Thanks, Aaron, it should be fine with partitions (I can repartition it anyway, right?). But rdd.toLocalIterator is purely Java/Scala method. Is there Python interface to it? I can get Java iterator though rdd._jrdd, but it isn't converted to Python iterator automatically. E.g.:
>>> rdd = sc.parallelize([1, 2, 3, 4, 5]) >>> it = rdd._jrdd.toLocalIterator() >>> next(it) 14/08/02 01:02:32 INFO SparkContext: Starting job: apply at Iterator.scala:371 ... 14/08/02 01:02:32 INFO SparkContext: Job finished: apply at Iterator.scala:371, took 0.02064317 s bytearray(b'\x80\x02K\x01.') I understand that returned byte array somehow corresponds to actual data, but how can I get it? On Fri, Aug 1, 2014 at 8:49 PM, Aaron Davidson <ilike...@gmail.com> wrote: > rdd.toLocalIterator will do almost what you want, but requires that each > individual partition fits in memory (rather than each individual line). > Hopefully that's sufficient, though. > > > On Fri, Aug 1, 2014 at 1:38 AM, Andrei <faithlessfri...@gmail.com> wrote: > >> Is there a way to get iterator from RDD? Something like rdd.collect(), >> but returning lazy sequence and not single array. >> >> Context: I need to GZip processed data to upload it to Amazon S3. Since >> archive should be a single file, I want to iterate over RDD, writing each >> line to a local .gz file. File is small enough to fit local disk, but still >> large enough not to fit into memory. >> > >