Hi Spark users,

I'm often using Spark for ETL type tasks, where the input is a large file
on-disk and the output is another large file on-disk.  I've loaded
everything into HDFS, but still need to produce files out on the other side.

Right now I produce these processed files in a 2-step process:

1) in a single spark job, read from HDFS location A, process, and write to
HDFS location B
2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto the
local disk.

It would be great to get this down to a 1-step process.

If I run .saveAsTextFile("...") on my RDD, then the shards of the file are
scattered onto the local disk across the cluster.  But if I .collect() on
the driver and then save to disk using normal Scala disk IO utilities, I'll
certainly OOM the driver.

*So the question*: is there a way to get an iterator for an RDD that I can
scan through the contents on the driver and flush to disk?

I found the RDD.iterator() method but it looks to be intended for use by
RDD subclasses not end users (requires a Partition and TaskContext
parameter).  The .foreach() method executes on each worker also, rather
than on the driver, so would also scatter files across the cluster if I
saved from there.

Any suggestions?

Thanks!
Andrew

Reply via email to