Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I think this repartitionAndSortWithinPartitions() method may be what I'm looking for in [1]. At least it sounds like it is. Will this method allow me to deal with sorted partitions even when the partition doesn't fit into memory? [1] https://github.com/apache/spark/blob/branch-1.2/core/src/main/sc

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet wrote: > In all of the soutions I've found thus far, sorting has been by casting > the

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
In all of the soutions I've found thus far, sorting has been by casting the partition iterator into an array and sorting the array. This is not going to work for my case as the amount of data in each partition may not necessarily fit into memory. Any ideas? On Wed, Jan 28, 2015 at 1:29 AM, Corey N

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I wanted to update this thread for others who may be looking for a solution to his as well. I found [1] and I'm going to investigate if this is a viable solution. [1] http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Wed, Jan 28, 2015 at 12:51 AM,

Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running the RDD through a custom partitioner be the