Collect will store the entire output in a List in memory. This solution is
acceptable for "Little Data" problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.

Most problems which benefit from Spark are large enough that even the data
assigned to a single partition will not fit into memory.

In my special case the output now is in the 0.5 - 4 GB range but in the
future might get to 4 times that size - something a single machine could
write but not hold at one time. I find that for most problems a file like
Part-0001 is not what the next step wants to use - the minute a step is
required to further process that file - even move and rename - there is
little reason not to let the spark code write what is wanted in the first
place.

I like the solution of using toLocalIterator and writing my own file

Reply via email to