Yes, saveAsTextFile() will give you 1 part per RDD partition. When you
coalesce(1), you move everything in the RDD to a single partition, which
then gives you 1 output file.

It will still be called part-00000 or something like that because that’s
defined by the Hadoop API that Spark uses for reading to/writing from S3. I
don’t know of a way to change that.


On Wed, Apr 30, 2014 at 2:47 PM, Peter <thenephili...@yahoo.com> wrote:

> Ah, looks like RDD.coalesce(1) solves one part of the problem.
>   On Wednesday, April 30, 2014 11:15 AM, Peter <thenephili...@yahoo.com>
> wrote:
>  Hi
>
> Playing around with Spark & S3, I'm opening multiple objects (CSV files)
> with:
>
>     val hfile = sc.textFile("s3n://bucket/2014-04-28/")
>
> so hfile is a RDD representing 10 objects that were "underneath"
> 2014-04-28. After I've sorted and otherwise transformed the content, I'm
> trying to write it back to a single object:
>
>
> sortedMap.values.map(_.mkString(",")).saveAsTextFile("s3n://bucket/concatted.csv")
>
> unfortunately this results in a "folder" named concatted.csv with 10
> objects underneath, part-00000 .. part-00010, corresponding to the 10
> original objects loaded.
>
> How can I achieve the desired behaviour of putting a single object named
> concatted.csv ?
>
> I've tried 0.9.1 and 1.0.0-RC3.
>
> Thanks!
> Peter
>
>
>
>
>

Reply via email to