Hi, Let's say the smaller subset has name A. It is a relatively small collection < 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with >10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big (> 100 KB)
For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is >100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.