Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like
javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei <faithlessfri...@gmail.com> wrote: > Thanks, Aaron, it should be fine with partitions (I can repartition it > anyway, right?). > But rdd.toLocalIterator is purely Java/Scala method. Is there Python > interface to it? > I can get Java iterator though rdd._jrdd, but it isn't converted to Python > iterator automatically. E.g.: > > >>> rdd = sc.parallelize([1, 2, 3, 4, 5]) > >>> it = rdd._jrdd.toLocalIterator() > >>> next(it) > 14/08/02 01:02:32 INFO SparkContext: Starting job: apply at > Iterator.scala:371 > ... > 14/08/02 01:02:32 INFO SparkContext: Job finished: apply at > Iterator.scala:371, took 0.02064317 s > bytearray(b'\x80\x02K\x01.') > > I understand that returned byte array somehow corresponds to actual data, > but how can I get it? > > > > On Fri, Aug 1, 2014 at 8:49 PM, Aaron Davidson <ilike...@gmail.com> wrote: > >> rdd.toLocalIterator will do almost what you want, but requires that each >> individual partition fits in memory (rather than each individual line). >> Hopefully that's sufficient, though. >> >> >> On Fri, Aug 1, 2014 at 1:38 AM, Andrei <faithlessfri...@gmail.com> wrote: >> >>> Is there a way to get iterator from RDD? Something like rdd.collect(), >>> but returning lazy sequence and not single array. >>> >>> Context: I need to GZip processed data to upload it to Amazon S3. Since >>> archive should be a single file, I want to iterate over RDD, writing each >>> line to a local .gz file. File is small enough to fit local disk, but still >>> large enough not to fit into memory. >>> >> >> >