markharw00d wrote:
Andrzej Bialecki wrote:

Funny, I was having vague thoughts about this today too having been concerned about some of the big arrays that can end up in a typical Lucene app. Aside from providing space-efiicient lookups, another application for BloomFilters is in similarity measures e.g. ANDing 2 BloomFilters can provide the basis for a rough measure of similarity.

Yes, it's an intriguing thought - worth checking how well it works in practice.

There are some other cool things you can do with BloomFilter-s:

* a counting BloomFilter allows both adding and deleting the keys, and the estimation of frequency of a key (if the same key is added multiple times).

* spectral BloomFilters provide a way to store a (small) exact value within a filter, so that when you test the membership you can also retrieve the value corresponding to that key (perhaps good for idf?). This could be potentially useful in many places, but one specific application is outlined here: http://markmail.org/message/xyqdz3go6jwu4ozm

* if you use BloomFilter just for quick lookup, but you want to eliminate false positives, it's possible to probe the main filter using all keys from the keyspace, and build a small table of exceptions that eliminates false positives from the first filter.

* dynamic BloomFilter-s don't require that you know the number of keys in advance. This could be useful e.g. to quickly eliminate nonexistent terms from a query, without actually looking up the query terms in the dictionary - just maintain a dynamic BloomFilter of all terms in the index.

* another thought: we could create a BloomFilter from all terms in a document, and store it as a field in a document. I don't know yet what for ... :) well, are there any scenarios when you test the document whether it contains a list of words? perhaps spam/porn detection.

Keep the ideas coming ... I have a feeling that eventually something useful will come out of it.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to