some tools exist for finding duplicated parts in document.
You split document in phrase, and build ngram with word. If you wont complete phrase, work with all words, for a partial, work with 5 words ngram, for example. ngram list is convert to hash, and hash is used as an indexed Field for the document. With this trick, you can use phrase as you use to do with word, without using too much space.

I'm not sure to be as clear as I want.

M.

Le 9 août 07 à 09:34, Akanksha Baid a écrit :

I was wondering if there is a "search based" method to find the top-k
frequent phrases in a set of documents.( I do not have a particular phrase
in mind so PhraseQuery can probably be ruled out).
I have implemented something that works using termvectors and termpositions
but the performance is not great so far since I am basically iterating
multiple times and hacking my way around. I was wondering if an API exists for finding frequent phrases and/or if someone could point me to some code
for the same.

Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to