sounds more like a use case for using "collect"... and writing out the file in your program?
On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > Sorry I missed the discussion - although it did not answer the question - > In my case (and I suspect the askers) the 100 slaves are doing a lot of > useful work but the generated output is small enough to be handled by a > single process. > Many of the large data problems I have worked process a lot of data but > end up with a single report file - frequently in a format specified by > preexisting downstream code. > I do not want a separate hadoop merge step for a lot of reasons > starting with > better control of the generation of the file. > However toLocalIterator is exactly what I need. > Somewhat off topic - I am being overwhelmed by getting a lot of emails > from the list - is there s way to get a daily summary which might be a lot > easier to keep up with > > > On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen <so...@cloudera.com> wrote: > >> This was covered a few days ago: >> >> >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html >> >> The multiple output files is actually essential for parallelism, and >> certainly not a bad idea. You don't want 100 distributed workers >> writing to 1 file in 1 place, not if you want it to be fast. >> >> RDD and JavaRDD already expose a method to iterate over the data, >> called toLocalIterator. It does not require that the RDD fit entirely >> in memory. >> >> On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2...@gmail.com> >> wrote: >> > At the end of a set of computation I have a JavaRDD<String> . I want a >> > single file where each string is printed in order. The data is small >> enough >> > that it is acceptable to handle the printout on a single processor. It >> may >> > be large enough that using collect to generate a list might be >> unacceptable. >> > the saveAsText command creates multiple files with names like part0000, >> > part0001 .... This was bed behavior in Hadoop for final output and is >> also >> > bad for Spark. >> > A more general issue is whether is it possible to convert a JavaRDD >> into >> > an iterator or iterable over then entire data set without using collect >> or >> > holding all data in memory. >> > In many problems where it is desirable to parallelize intermediate >> steps >> > but use a single process for handling the final result this could be >> very >> > useful. >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > -- jay vyas