Jake, Thanks for mentioning this approach. The ShingleFilter/ShingleAnalyzerWrapper is pretty handy and I'd never used it before.
Is there a bloom filter implementation somewhere in Mahout or elsewhere in the lucene ecosystem? Drew On Wed, Jan 6, 2010 at 8:41 PM, Jake Mannix <[email protected]> wrote: > The way I've done this is to take whatever unigram analyzer for tokenization > that > fits what you want to do, wrap it in Lucene's ShingleAnalyzer, and use that > as the > "tokenizer" (which now produces ngram tokens as single tokens each), and run > that > through the LLR ngram M/R job (which ends by sorting descending by LLR > score), > and shove the top-K ngrams (and sometimes the unigrams which fit some > "good" > IDF range) into a big bloom filter, which is serialized and saved. > > With that, you can take that original ShingleAnalyzer you used previously, > and to > produce vectors, you take the ngram token stream output and check each > emitted > token to see if it is the bloom filter, if not, discard. If it is, you can > hash (or multiply > hash it) it to get the ngram id for that token. Of course, that doesn't > properly > normalize the columns of your term-document matrix (you don't have your IDF > factors), but you can do that as a post-processing step after this one. > > -jake
