Hey,

I need to scan a large "key-value" collection as below:

1) sort it on an attribute of “value” 
2) scan it one by one, from element with largest value
2.1) if the current element matches a pre-defined condition, its value will be 
reduced and the element will be inserted back to collection. 
if not, this current element should be removed from collection.


In my previous program, the 1) step can be easily conducted in Spark(RDD 
operation), but I am not sure how to do 2.1) step, esp. the “put/inserted back” 
operation on a sorted RDD.
I have tried to make a new RDD at every-time an element was found to inserted, 
but it is very costly due to a re-sorting…


Is there anyone having some ideas?

Thanks so much!

******************
an example:

the sorted result of initial collection C(on bold value), sortedC:
(1, (71, “aaa"))
(2, (60, “bbb"))
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(5, (29, “eee"))
…

pre-condition: its_value%2 == 0
if pre-condition is matched, its value will be reduce on half.

Thus:

#1:
71 is not matched, so this element is removed.
(1, (71, “aaa”)) —> removed!
(2, (60, “bbb"))
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(5, (29, “eee"))
…

#2:
60 is matched! 60/2 = 30, the collection right now should be as:
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(2, (30, “bbb”)) <— inserted back here
(5, (29, “eee"))
…






Best,
Yifan LI





Reply via email to