I wanted to update this thread for others who may be looking for a solution
to his as well. I found [1] and I'm going to investigate if this is a
viable solution.

[1]
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job

On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet <cjno...@gmail.com> wrote:

> I need to be able to take an input RDD[Map[String,Any]] and split it into
> several different RDDs based on some partitionable piece of the key
> (groups) and then send each partition to a separate set of files in
> different folders in HDFS.
>
> 1) Would running the RDD through a custom partitioner be the best way to
> go about this or should I split the RDD into different RDDs and call
> saveAsHadoopFile() on each?
> 2) I need the resulting partitions sorted by key- they also need to be
> written to the underlying files in sorted order.
> 3) The number of keys in each partition will almost always be too big to
> fit into memory.
>
> Thanks.
>

Reply via email to