Are the keys in collection B unique?

If so, I would like to try this approach:
For each <key, value> of collection B, make a file out of it with file name
given by MD5 hash of the "key", and "value" being its content, and then
store all these files into a HAR archive.
The HAR archive will create an index for you over the keys.
Now you can iterate over the collection A, get the MD5 hash of the key, and
look up in the archive for the file (to get the value).

On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ <tbr...@blue.lu> wrote:

>
> Hi,
>
> Let's say the smaller subset has name A. It is a relatively small
> collection
> < 100 000 entries (could also be only 100), with nearly no payload as
> value.
> Collection B is a big collection with >10 000 000 entries (Each key of A
> also exists in the collection B), where the value for each key is
> relatively
> big (> 100 KB)
>
> For all the keys in A, I need to get the corresponding value from B and
> collect it in the output.
>
>
> - I can do this by reading in both files, and on the reduce step, do my
> computations and collect only those which are both in A and B. The map
> phase
> however will take very long as all the key/value pairs of collection B need
> to be sorted (and each key's value is >100 KB) at the end of the map phase,
> which is overkill if A is very small.
>
> What I would need is an option to somehow make the intersection first
> (Mapper only on keys, then a reduce functino based only on keys and not the
> corresponding values which collects the keys I want to take), and then
> running the map input and filtering the output collector or the input based
> on the results from the reduce phase.
>
> Or is there another faster way? Collection A could be so big that it
> doesn't
> fit into the memory. I could split collection A up into multiple smaller
> collections, but that would make it more complicated, so I want to evade
> that route. (This is similar to the approach I described above, just a
> manual approach)
>
> Thanks,
> Thibaut
> --
> View this message in context:
> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Reply via email to