This was covered a few days ago: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html
The multiple output files is actually essential for parallelism, and certainly not a bad idea. You don't want 100 distributed workers writing to 1 file in 1 place, not if you want it to be fast. RDD and JavaRDD already expose a method to iterate over the data, called toLocalIterator. It does not require that the RDD fit entirely in memory. On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > At the end of a set of computation I have a JavaRDD<String> . I want a > single file where each string is printed in order. The data is small enough > that it is acceptable to handle the printout on a single processor. It may > be large enough that using collect to generate a list might be unacceptable. > the saveAsText command creates multiple files with names like part0000, > part0001 .... This was bed behavior in Hadoop for final output and is also > bad for Spark. > A more general issue is whether is it possible to convert a JavaRDD into > an iterator or iterable over then entire data set without using collect or > holding all data in memory. > In many problems where it is desirable to parallelize intermediate steps > but use a single process for handling the final result this could be very > useful. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org