Sorry I missed the discussion - although it did not answer the question - In my case (and I suspect the askers) the 100 slaves are doing a lot of useful work but the generated output is small enough to be handled by a single process. Many of the large data problems I have worked process a lot of data but end up with a single report file - frequently in a format specified by preexisting downstream code. I do not want a separate hadoop merge step for a lot of reasons starting with better control of the generation of the file. However toLocalIterator is exactly what I need. Somewhat off topic - I am being overwhelmed by getting a lot of emails from the list - is there s way to get a daily summary which might be a lot easier to keep up with
On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen <so...@cloudera.com> wrote: > This was covered a few days ago: > > > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html > > The multiple output files is actually essential for parallelism, and > certainly not a bad idea. You don't want 100 distributed workers > writing to 1 file in 1 place, not if you want it to be fast. > > RDD and JavaRDD already expose a method to iterate over the data, > called toLocalIterator. It does not require that the RDD fit entirely > in memory. > > On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > > At the end of a set of computation I have a JavaRDD<String> . I want a > > single file where each string is printed in order. The data is small > enough > > that it is acceptable to handle the printout on a single processor. It > may > > be large enough that using collect to generate a list might be > unacceptable. > > the saveAsText command creates multiple files with names like part0000, > > part0001 .... This was bed behavior in Hadoop for final output and is > also > > bad for Spark. > > A more general issue is whether is it possible to convert a JavaRDD > into > > an iterator or iterable over then entire data set without using collect > or > > holding all data in memory. > > In many problems where it is desirable to parallelize intermediate > steps > > but use a single process for handling the final result this could be very > > useful. > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com