sounds more like a use case for using "collect"... and writing out the file
in your program?

On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:

> Sorry I missed the discussion - although it did not answer the question -
> In my case (and I suspect the askers) the 100 slaves are doing a lot of
> useful work but the generated output is small enough to be handled by a
> single process.
> Many of the large data problems I have worked process a lot of data but
> end up with a single report file - frequently in a format specified by
> preexisting downstream code.
>   I do not want a separate  hadoop merge step for a lot of reasons
> starting with
> better control of the generation of the file.
> However toLocalIterator is exactly what I need.
> Somewhat off topic - I am being overwhelmed by getting a lot of emails
> from the list - is there s way to get a daily summary which might be a lot
> easier to keep up with
>
>
> On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> This was covered a few days ago:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html
>>
>> The multiple output files is actually essential for parallelism, and
>> certainly not a bad idea. You don't want 100 distributed workers
>> writing to 1 file in 1 place, not if you want it to be fast.
>>
>> RDD and  JavaRDD already expose a method to iterate over the data,
>> called toLocalIterator. It does not require that the RDD fit entirely
>> in memory.
>>
>> On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2...@gmail.com>
>> wrote:
>> >   At the end of a set of computation I have a JavaRDD<String> . I want a
>> > single file where each string is printed in order. The data is small
>> enough
>> > that it is acceptable to handle the printout on a single processor. It
>> may
>> > be large enough that using collect to generate a list might be
>> unacceptable.
>> > the saveAsText command creates multiple files with names like part0000,
>> > part0001 .... This was bed behavior in Hadoop for final output and is
>> also
>> > bad for Spark.
>> >   A more general issue is whether is it possible to convert a JavaRDD
>> into
>> > an iterator or iterable over then entire data set without using collect
>> or
>> > holding all data in memory.
>> >    In many problems where it is desirable to parallelize intermediate
>> steps
>> > but use a single process for handling the final result this could be
>> very
>> > useful.
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>


-- 
jay vyas

Reply via email to