[algogeeks] Median of billion numbers - Map Reduce

Dumanshu Sun, 17 Jul 2011 08:27:25 -0700

Given billions of integers in a file. Memory Constraints exist so need
to parallelize. One solution can be to split the file over a 100
machines and each machine calculates median of their part using
quickselect and sends the result to host machine. Now the host
calculates median of these medians and asks the sub machines to
discard all the numbers less than this final median. So, map reduce!


Now i want to do something like this -

Each sub machine has billion/100 integers. It chooses an integer k
randomly and divides its integers into two parts, one is less than k
and other set is more than k (like quickselect algo). This machine
returns to the host basically 3 things, one is k, second and third is
number of elements in both sets.
After having all that data from submachines, the host does some
calculation and asks each machine to discard either the less than set
or more than set. then whole thing is repeated with lesser number of
elements.

Any ideas about how the host machine would do that and on what basis?

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

[algogeeks] Median of billion numbers - Map Reduce

Reply via email to