Hey Steve - the way to do this is to use the coalesce() function to coalesce your RDD into a single partition. Then you can do a saveAsTextFile and you'll wind up with outpuDir/part-00000 containing all the data.
-Ilya Ganelin On Mon, Oct 20, 2014 at 11:01 PM, jay vyas <jayunit100.apa...@gmail.com> wrote: > sounds more like a use case for using "collect"... and writing out the > file in your program? > > On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> Sorry I missed the discussion - although it did not answer the question - >> In my case (and I suspect the askers) the 100 slaves are doing a lot of >> useful work but the generated output is small enough to be handled by a >> single process. >> Many of the large data problems I have worked process a lot of data but >> end up with a single report file - frequently in a format specified by >> preexisting downstream code. >> I do not want a separate hadoop merge step for a lot of reasons >> starting with >> better control of the generation of the file. >> However toLocalIterator is exactly what I need. >> Somewhat off topic - I am being overwhelmed by getting a lot of emails >> from the list - is there s way to get a daily summary which might be a lot >> easier to keep up with >> >> >> On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> This was covered a few days ago: >>> >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html >>> >>> The multiple output files is actually essential for parallelism, and >>> certainly not a bad idea. You don't want 100 distributed workers >>> writing to 1 file in 1 place, not if you want it to be fast. >>> >>> RDD and JavaRDD already expose a method to iterate over the data, >>> called toLocalIterator. It does not require that the RDD fit entirely >>> in memory. >>> >>> On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2...@gmail.com> >>> wrote: >>> > At the end of a set of computation I have a JavaRDD<String> . I want >>> a >>> > single file where each string is printed in order. The data is small >>> enough >>> > that it is acceptable to handle the printout on a single processor. It >>> may >>> > be large enough that using collect to generate a list might be >>> unacceptable. >>> > the saveAsText command creates multiple files with names like part0000, >>> > part0001 .... This was bed behavior in Hadoop for final output and is >>> also >>> > bad for Spark. >>> > A more general issue is whether is it possible to convert a JavaRDD >>> into >>> > an iterator or iterable over then entire data set without using >>> collect or >>> > holding all data in memory. >>> > In many problems where it is desirable to parallelize intermediate >>> steps >>> > but use a single process for handling the final result this could be >>> very >>> > useful. >>> >> >> >> >> -- >> Steven M. Lewis PhD >> 4221 105th Ave NE >> Kirkland, WA 98033 >> 206-384-1340 (cell) >> Skype lordjoe_com >> >> > > > -- > jay vyas >