Re: n-grams for terms?

Jason Rutherglen Thu, 07 Jan 2010 18:21:23 -0800

Hadoop has robust bloom filters, easy to use API... There's others in
open source land that are either lacking in features or performance
(like the, sorry to be vague, one of the open source Facebook
projects).


On Wed, Jan 6, 2010 at 7:34 PM, Otis Gospodnetic
<[email protected]> wrote:
> Drew - check out Hadoop, I believe there are a few Bloom filter 
> implementations there.
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
>> From: Drew Farris <[email protected]>
>> To: [email protected]
>> Sent: Wed, January 6, 2010 10:23:52 PM
>> Subject: Re: n-grams for terms?
>>
>> Jake,
>>
>> Thanks for mentioning this approach. The
>> ShingleFilter/ShingleAnalyzerWrapper is pretty handy and I'd never
>> used it before.
>>
>> Is there a bloom filter implementation somewhere in Mahout or
>> elsewhere in the lucene ecosystem?
>>
>> Drew
>>
>> On Wed, Jan 6, 2010 at 8:41 PM, Jake Mannix wrote:
>>
>> > The way I've done this is to take whatever unigram analyzer for 
>> > tokenization
>> >  that
>> > fits what you want to do, wrap it in Lucene's ShingleAnalyzer, and use that
>> > as the
>> > "tokenizer" (which now produces ngram tokens as single tokens each), and 
>> > run
>> > that
>> > through the LLR ngram M/R job (which ends by sorting descending by LLR
>> > score),
>> > and shove the top-K ngrams (and sometimes the unigrams which fit some
>> > "good"
>> > IDF range) into a big bloom filter, which is serialized and saved.
>> >
>> > With that, you can take that original ShingleAnalyzer you used previously,
>> > and to
>> > produce vectors, you take the ngram token stream output and check each
>> > emitted
>> > token to see if it is the bloom filter, if not, discard.  If it is, you can
>> > hash (or multiply
>> > hash it) it to get the ngram id for that token.  Of course, that doesn't
>> > properly
>> > normalize the columns of your term-document matrix (you don't have your IDF
>> > factors), but you can do that as a post-processing step after this one.
>> >
>> >  -jake
>
>

Re: n-grams for terms?

Reply via email to