This was covered a few days ago:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html

The multiple output files is actually essential for parallelism, and
certainly not a bad idea. You don't want 100 distributed workers
writing to 1 file in 1 place, not if you want it to be fast.

RDD and  JavaRDD already expose a method to iterate over the data,
called toLocalIterator. It does not require that the RDD fit entirely
in memory.

On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
>   At the end of a set of computation I have a JavaRDD<String> . I want a
> single file where each string is printed in order. The data is small enough
> that it is acceptable to handle the printout on a single processor. It may
> be large enough that using collect to generate a list might be unacceptable.
> the saveAsText command creates multiple files with names like part0000,
> part0001 .... This was bed behavior in Hadoop for final output and is also
> bad for Spark.
>   A more general issue is whether is it possible to convert a JavaRDD into
> an iterator or iterable over then entire data set without using collect or
> holding all data in memory.
>    In many problems where it is desirable to parallelize intermediate steps
> but use a single process for handling the final result this could be very
> useful.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to