Yes, the current implementation has the memory limitation, the community 
already noticed this problem and there's a patch to solve this problem 
(PR931<https://github.com/apache/spark/pull/931>), you can click to see the 
details.

Also as you said, current Spark cannot guarantee the order of elements within 
partitions after shuffle, so you have to sort by yourself.

Thanks
Saisai.

From: Parsian, Mahmoud [mailto:[email protected]]
Sent: Monday, June 30, 2014 11:08 AM
To: [email protected]
Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting

Hi Jerry,

Thank you for replying to my question. If indeed, spark does not have 
"secondary sort" by framework, then that is a limitation. There might be cases 
where you have more values per key that can not be handled in a commodity 
server's memory (I mean sorting values in RAM).

If we had a partitioner by a natural key (by name), which preserved the order 
of RDD, then that would be a viable solution: for example, if we sort by (name, 
time), we would get:

(x,1),(1,3)
(x,2),(2,9)
(x,3),(3,6)
(y,1),(1,7)
(y,2),(2,5)
(y,3),(3,1)
(z,1),(1,4)
(z,2),(2,8)
(z,3),(3,7)
(z,4),(4,0)

There is a partitioner, but it does not preserve the order of RDD elements.

Thanks again,
best,
Mahmoud

________________________________
From: Shao, Saisai [[email protected]]
Sent: Sunday, June 29, 2014 6:41 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting
Hi Mahmoud,

I think you cannot achieve this in current Spark framework, because current 
Spark's Shuffle is based on hash, which is different from MapReduce's 
sort-based shuffle, so you should implement sorting explicitly using RDD 
operator.

Thanks
Jerry

From: Parsian, Mahmoud [mailto:[email protected]]
Sent: Monday, June 30, 2014 9:00 AM
To: [email protected]<mailto:[email protected]>
Subject: Sorting Reduced/Groupd Values without Explicit Sorting

Given the following time series data:

name, time, value
x,2,9
x,1,3
x,3,6
y,2,5
y,1,7
y,3,1
z,3,7
z,4,0
z,1,4
z,2,8

we want to generate the following (the reduced/grouped values are sorted by 
time).

x => [(1,3), (2,9), (3,6)]
y => [(1,7), (2,5), (3,1)]
z => [(1,4), (2,8), (3,7), (4,0)]

One obvious way to sort the value by time is that use Java's collection sort 
(to sort in memory).

How can we achieve sorted values by time WITHOUT explicit sorting in Spark (I 
mean by using Spark framework)?

In Java/MapReduce/Hadoop, we can sort reducer values without explicit sorting:
        job.setPartitionerClass(MyPartitioner.class);
        job.setGroupingComparatorClass(MyGroupingComparator.class);

The question is how to sort grouped/reduced values without explicit sorting?

Thanks,
best,
Mahmoud






Reply via email to