Drew - check out Hadoop, I believe there are a few Bloom filter implementations there. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
----- Original Message ---- > From: Drew Farris <[email protected]> > To: [email protected] > Sent: Wed, January 6, 2010 10:23:52 PM > Subject: Re: n-grams for terms? > > Jake, > > Thanks for mentioning this approach. The > ShingleFilter/ShingleAnalyzerWrapper is pretty handy and I'd never > used it before. > > Is there a bloom filter implementation somewhere in Mahout or > elsewhere in the lucene ecosystem? > > Drew > > On Wed, Jan 6, 2010 at 8:41 PM, Jake Mannix wrote: > > > The way I've done this is to take whatever unigram analyzer for tokenization > > that > > fits what you want to do, wrap it in Lucene's ShingleAnalyzer, and use that > > as the > > "tokenizer" (which now produces ngram tokens as single tokens each), and run > > that > > through the LLR ngram M/R job (which ends by sorting descending by LLR > > score), > > and shove the top-K ngrams (and sometimes the unigrams which fit some > > "good" > > IDF range) into a big bloom filter, which is serialized and saved. > > > > With that, you can take that original ShingleAnalyzer you used previously, > > and to > > produce vectors, you take the ngram token stream output and check each > > emitted > > token to see if it is the bloom filter, if not, discard. If it is, you can > > hash (or multiply > > hash it) it to get the ngram id for that token. Of course, that doesn't > > properly > > normalize the columns of your term-document matrix (you don't have your IDF > > factors), but you can do that as a post-processing step after this one. > > > > -jake
