Hi Jerry, Thank you for replying to my question. If indeed, spark does not have "secondary sort" by framework, then that is a limitation. There might be cases where you have more values per key that can not be handled in a commodity server's memory (I mean sorting values in RAM).
If we had a partitioner by a natural key (by name), which preserved the order of RDD, then that would be a viable solution: for example, if we sort by (name, time), we would get: (x,1),(1,3) (x,2),(2,9) (x,3),(3,6) (y,1),(1,7) (y,2),(2,5) (y,3),(3,1) (z,1),(1,4) (z,2),(2,8) (z,3),(3,7) (z,4),(4,0) There is a partitioner, but it does not preserve the order of RDD elements. Thanks again, best, Mahmoud ________________________________ From: Shao, Saisai [[email protected]] Sent: Sunday, June 29, 2014 6:41 PM To: [email protected] Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting Hi Mahmoud, I think you cannot achieve this in current Spark framework, because current Spark’s Shuffle is based on hash, which is different from MapReduce’s sort-based shuffle, so you should implement sorting explicitly using RDD operator. Thanks Jerry From: Parsian, Mahmoud [mailto:[email protected]] Sent: Monday, June 30, 2014 9:00 AM To: [email protected] Subject: Sorting Reduced/Groupd Values without Explicit Sorting Given the following time series data: name, time, value x,2,9 x,1,3 x,3,6 y,2,5 y,1,7 y,3,1 z,3,7 z,4,0 z,1,4 z,2,8 we want to generate the following (the reduced/grouped values are sorted by time). x => [(1,3), (2,9), (3,6)] y => [(1,7), (2,5), (3,1)] z => [(1,4), (2,8), (3,7), (4,0)] One obvious way to sort the value by time is that use Java's collection sort (to sort in memory). How can we achieve sorted values by time WITHOUT explicit sorting in Spark (I mean by using Spark framework)? In Java/MapReduce/Hadoop, we can sort reducer values without explicit sorting: job.setPartitionerClass(MyPartitioner.class); job.setGroupingComparatorClass(MyGroupingComparator.class); The question is how to sort grouped/reduced values without explicit sorting? Thanks, best, Mahmoud
