Re: frequent phrases

Mathieu Lecarme Fri, 10 Aug 2007 09:41:08 -0700

some tools exist for finding duplicated parts in document.

You split document in phrase, and build ngram with word. If you wontcomplete phrase, work with all words, for a partial, work with 5words ngram, for example. ngram list is convert to hash, and hash isused as an indexed Field for the document. With this trick, you canuse phrase as you use to do with word, without using too much space.


I'm not sure to be as clear as I want.

M.

Le 9 août 07 à 09:34, Akanksha Baid a écrit :

I was wondering if there is a "search based" method to find the top-k
frequent phrases in a set of documents.( I do not have a particularphrase
in mind so PhraseQuery can probably be ruled out).
I have implemented something that works using termvectors andtermpositions
but the performance is not great so far since I am basically iterating
multiple times and hacking my way around. I was wondering if an APIexistsfor finding frequent phrases and/or if someone could point me tosome code
for the same.

Thanks.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: frequent phrases

Reply via email to