Re: Write 1 RDD to multiple output paths in one go

Sean Owen Mon, 15 Sep 2014 07:29:48 -0700

AFAIK there is no direct equivalent in Spark. You can cache or persist
and RDD, and then run N separate operations to output different things
from it, which is pretty close.


I think you might be able to get this working with a subclass of
MultipleTextOutputFormat, which overrides generateFileNameForKeyValue,
generateActualKey, etc. A bit of work for sure, but probably works.

Finally, I wonder if you can get away with the fact that 1 partition
generally == 1 file, and shuffle your data into the right partitions
at the end in order to have them output together in files (or groups
of files).

On Sat, Sep 13, 2014 at 6:25 PM, Nick Chammas
<nicholas.cham...@gmail.com> wrote:
> Howdy doody Spark Users,
>
> I’d like to somehow write out a single RDD to multiple paths in one go.
> Here’s an example.
>
> I have an RDD of (key, value) pairs like this:
>
>>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben',
>>>> 'Frankie']).keyBy(lambda x: x[0])
>>>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F',
> 'Frankie')]
>
> Now I want to write the RDD out to different paths depending on the keys, so
> that I have one output directory per distinct key. Each output directory
> could potentially have multiple part- files or whatever.
>
> So my output would be something like:
>
> /path/prefix/n [/part-1, /part-2, etc]
> /path/prefix/b [/part-1, /part-2, etc]
> /path/prefix/f [/part-1, /part-2, etc]
>
> How would you do that?
>
> I suspect I need to use saveAsNewAPIHadoopFile or saveAsHadoopFile along
> with the MultipleTextOutputFormat output format class, but I’m not sure how.
>
> By the way, there is a very similar question to this here on Stack Overflow.
>
> Nick
>
>
> ________________________________
> View this message in context: Write 1 RDD to multiple output paths in one go
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Write 1 RDD to multiple output paths in one go

Reply via email to