Re: Split RDD and save as separate files

Nicholas Pritchard Wed, 11 Sep 2013 10:28:53 -0700

Thanks for the tip Matei!


On Wed, Sep 11, 2013 at 4:12 AM,
<[email protected]> wrote:
>
>
> Hi Nicholas,
>
> Right now the best way to do this is probably to run foreach() on each value 
> and then use the Hadoop FileSystem API directly to write a file. It has a 
> pretty simple API based on OutputStreams: 
> http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html.
>  You just have to call FileSystem.get(URI, Configuration) and then call 
> create() on it to write a file. You may want to put the file into a temp 
> location first and only rename it to the final name after the task is 
> successful to deal well with task failures.
>
> Matei
>
> On Sep 10, 2013, at 10:16 PM, Nicholas Pritchard 
> <[email protected]> wrote:
>
> > Hi,
> >
> > I have an RDD of (Key, Value) pairs that I would like to save to HDFS. 
> > However, rather than putting everything into one file, I would like to 
> > split the RDD by key and save each part as a separate file. The key would 
> > become the filename.
> >
> > In short, I am trying to do something like this:
> > myRDD.groupByKey().foreach{ case(key, values) => values.saveAsTextFile(key) 
> > }
> >
> > This obviously doesn't work since values is of type Seq[V] instead of 
> > RDD[V], but does anyone have any suggestions for doing this efficiently? 
> > Currently, I am repeatedly filtering and saving the RDD, but this seems 
> > inefficient.
> >
> > Thanks,
> > Nick
>
>

Re: Split RDD and save as separate files

Reply via email to