Hey Adrian, Thanks for your fast reply. :)
Actually the “pre-condition” is not fixed in real application, e.g. it would change based on counting of previous unmatched elements. So I need to use iterator operator, rather than flatMap-like operators… Besides, do you have any idea on how to avoid that “sort again”? it is too costly… :( Anyway thank you again! Best, Yifan LI > On 12 Oct 2015, at 12:19, Adrian Tanase <atan...@adobe.com> wrote: > > I think you’re looking for the flatMap (or flatMapValues) operator – you can > do something like > > sortedRdd.flatMapValues( v => > If (v % 2 == 0) { > Some(v / 2) > } else { > None > } > ) > > Then you need to sort again. > > -adrian > > From: Yifan LI > Date: Monday, October 12, 2015 at 1:03 PM > To: spark users > Subject: "dynamically" sort a large collection? > > Hey, > > I need to scan a large "key-value" collection as below: > > 1) sort it on an attribute of “value” > 2) scan it one by one, from element with largest value > 2.1) if the current element matches a pre-defined condition, its value will > be reduced and the element will be inserted back to collection. > if not, this current element should be removed from collection. > > > In my previous program, the 1) step can be easily conducted in Spark(RDD > operation), but I am not sure how to do 2.1) step, esp. the “put/inserted > back” operation on a sorted RDD. > I have tried to make a new RDD at every-time an element was found to > inserted, but it is very costly due to a re-sorting… > > > Is there anyone having some ideas? > > Thanks so much! > > ****************** > an example: > > the sorted result of initial collection C(on bold value), sortedC: > (1, (71, “aaa")) > (2, (60, “bbb")) > (3, (53.5, “ccc”)) > (4, (48, “ddd”)) > (5, (29, “eee")) > … > > pre-condition: its_value%2 == 0 > if pre-condition is matched, its value will be reduce on half. > > Thus: > > #1: > 71 is not matched, so this element is removed. > (1, (71, “aaa”)) —> removed! > (2, (60, “bbb")) > (3, (53.5, “ccc”)) > (4, (48, “ddd”)) > (5, (29, “eee")) > … > > #2: > 60 is matched! 60/2 = 30, the collection right now should be as: > (3, (53.5, “ccc”)) > (4, (48, “ddd”)) > (2, (30, “bbb”)) <— inserted back here > (5, (29, “eee")) > … > > > > > > > Best, > Yifan LI > > > > >