Hi Spark users, I'm often using Spark for ETL type tasks, where the input is a large file on-disk and the output is another large file on-disk. I've loaded everything into HDFS, but still need to produce files out on the other side.
Right now I produce these processed files in a 2-step process: 1) in a single spark job, read from HDFS location A, process, and write to HDFS location B 2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto the local disk. It would be great to get this down to a 1-step process. If I run .saveAsTextFile("...") on my RDD, then the shards of the file are scattered onto the local disk across the cluster. But if I .collect() on the driver and then save to disk using normal Scala disk IO utilities, I'll certainly OOM the driver. *So the question*: is there a way to get an iterator for an RDD that I can scan through the contents on the driver and flush to disk? I found the RDD.iterator() method but it looks to be intended for use by RDD subclasses not end users (requires a Partition and TaskContext parameter). The .foreach() method executes on each worker also, rather than on the driver, so would also scatter files across the cluster if I saved from there. Any suggestions? Thanks! Andrew