AFAIK there is no direct equivalent in Spark. You can cache or persist and RDD, and then run N separate operations to output different things from it, which is pretty close.
I think you might be able to get this working with a subclass of MultipleTextOutputFormat, which overrides generateFileNameForKeyValue, generateActualKey, etc. A bit of work for sure, but probably works. Finally, I wonder if you can get away with the fact that 1 partition generally == 1 file, and shuffle your data into the right partitions at the end in order to have them output together in files (or groups of files). On Sat, Sep 13, 2014 at 6:25 PM, Nick Chammas <nicholas.cham...@gmail.com> wrote: > Howdy doody Spark Users, > > I’d like to somehow write out a single RDD to multiple paths in one go. > Here’s an example. > > I have an RDD of (key, value) pairs like this: > >>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', >>>> 'Frankie']).keyBy(lambda x: x[0]) >>>> a.collect() > [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', > 'Frankie')] > > Now I want to write the RDD out to different paths depending on the keys, so > that I have one output directory per distinct key. Each output directory > could potentially have multiple part- files or whatever. > > So my output would be something like: > > /path/prefix/n [/part-1, /part-2, etc] > /path/prefix/b [/part-1, /part-2, etc] > /path/prefix/f [/part-1, /part-2, etc] > > How would you do that? > > I suspect I need to use saveAsNewAPIHadoopFile or saveAsHadoopFile along > with the MultipleTextOutputFormat output format class, but I’m not sure how. > > By the way, there is a very similar question to this here on Stack Overflow. > > Nick > > > ________________________________ > View this message in context: Write 1 RDD to multiple output paths in one go > Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org