Thanks for the tip Matei!
On Wed, Sep 11, 2013 at 4:12 AM, <[email protected]> wrote: > > > Hi Nicholas, > > Right now the best way to do this is probably to run foreach() on each value > and then use the Hadoop FileSystem API directly to write a file. It has a > pretty simple API based on OutputStreams: > http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html. > You just have to call FileSystem.get(URI, Configuration) and then call > create() on it to write a file. You may want to put the file into a temp > location first and only rename it to the final name after the task is > successful to deal well with task failures. > > Matei > > On Sep 10, 2013, at 10:16 PM, Nicholas Pritchard > <[email protected]> wrote: > > > Hi, > > > > I have an RDD of (Key, Value) pairs that I would like to save to HDFS. > > However, rather than putting everything into one file, I would like to > > split the RDD by key and save each part as a separate file. The key would > > become the filename. > > > > In short, I am trying to do something like this: > > myRDD.groupByKey().foreach{ case(key, values) => values.saveAsTextFile(key) > > } > > > > This obviously doesn't work since values is of type Seq[V] instead of > > RDD[V], but does anyone have any suggestions for doing this efficiently? > > Currently, I am repeatedly filtering and saving the RDD, but this seems > > inefficient. > > > > Thanks, > > Nick > >
