Bloom Filters are one of the greatest things ever, so it is nice to see another application.
Remember that your filter may make mistakes -you will see items that are not in the set. Also, instead of setting a single bit per item (in the A set), set k distinct bits. You can analytically work-out the best k for a given number of items and for some amount of memory. In practice, this usually boils-down to k being 3 or so for a reasonable error rate. Happy hunting Miles 2009/2/12 Thibaut_ <tbr...@blue.lu>: > > Thanks, > > I didn't think about the bloom filter variant. That's the solution I was > looking for :-) > > Thibaut > -- > View this message in context: > http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.