Finding small subset in very large dataset

Thibaut_ Wed, 11 Feb 2009 13:39:38 -0800

Hi,

Let's say the smaller subset has name A. It is a relatively small collection
< 100 000 entries (could also be only 100), with nearly no payload as value. 
Collection B is a big collection with >10 000 000 entries (Each key of A
also exists in the collection B), where the value for each key is relatively
big (> 100 KB)


For all the keys in A, I need to get the corresponding value from B and
collect it in the output.


- I can do this by reading in both files, and on the reduce step, do my
computations and collect only those which are both in A and B. The map phase
however will take very long as all the key/value pairs of collection B need
to be sorted (and each key's value is >100 KB) at the end of the map phase,
which is overkill if A is very small.

What I would need is an option to somehow make the intersection first
(Mapper only on keys, then a reduce functino based only on keys and not the
corresponding values which collects the keys I want to take), and then
running the map input and filtering the output collector or the input based
on the results from the reduce phase.

Or is there another faster way? Collection A could be so big that it doesn't
fit into the memory. I could split collection A up into multiple smaller
collections, but that would make it more complicated, so I want to evade
that route. (This is similar to the approach I described above, just a
manual approach)

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Finding small subset in very large dataset

Reply via email to