Bloom Filters are one of the greatest things ever, so it is nice to
see another application.

Remember that your filter may make mistakes  -you will see items that
are not in the set.  Also, instead of setting a single bit per item
(in the A set), set k distinct bits.

You can analytically work-out the best k for a given number of items
and for some amount of memory.  In practice, this usually boils-down
to k being 3 or so for a reasonable error rate.

Happy hunting

Miles

2009/2/12 Thibaut_ <tbr...@blue.lu>:
>
> Thanks,
>
> I didn't think about the bloom filter variant. That's the solution I was
> looking for :-)
>
> Thibaut
> --
> View this message in context: 
> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to