I think this repartitionAndSortWithinPartitions() method may be what I'm
looking for in [1]. At least it sounds like it is. Will this method allow
me to deal with sorted partitions even when the partition doesn't fit into
memory?
[1]
https://github.com/apache/spark/blob/branch-1.2/core/src/main/sc
I'm looking @ the ShuffledRDD code and it looks like there is a method
setKeyOrdering()- is this guaranteed to order everything in the partition?
I'm on Spark 1.2.0
On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet wrote:
> In all of the soutions I've found thus far, sorting has been by casting
> the
In all of the soutions I've found thus far, sorting has been by casting the
partition iterator into an array and sorting the array. This is not going
to work for my case as the amount of data in each partition may not
necessarily fit into memory. Any ideas?
On Wed, Jan 28, 2015 at 1:29 AM, Corey N
I wanted to update this thread for others who may be looking for a solution
to his as well. I found [1] and I'm going to investigate if this is a
viable solution.
[1]
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
On Wed, Jan 28, 2015 at 12:51 AM,
I need to be able to take an input RDD[Map[String,Any]] and split it into
several different RDDs based on some partitionable piece of the key
(groups) and then send each partition to a separate set of files in
different folders in HDFS.
1) Would running the RDD through a custom partitioner be the