Re: Phrase Frequency For Analysis

2006-06-22 Thread Bob Carpenter
Adding to this growing thread, there's really no reason to index all the term bigrams, trigrams, etc. It's not only slow, it's very memory/disk intensive. All you need to do is two passes over the collection. Pass One Collect counts of bigrams (or trigrams, or whatever -- if size is an

Re: Phrase Frequency For Analysis

2006-06-22 Thread Andrzej Bialecki
Nader Akhnoukh wrote: Yes, Chris is correct, the goal is to determine the most frequently occuring phrases in a document compared to the frequency of that phrase in the index. So there are only output phrases, no inputs. Also performance is not really an issue, this would take place on an irre

Re: Phrase Frequency For Analysis

2006-06-22 Thread Kamal Abou Mikhael
I may be coming into this thread without knowing enough. I have implemented a phrase filter, which indexes all token sequences that are 2 to N tokens long. The n is defined in the constructor. It takes a stopword Trie for input because the policy I used, based on a publish work I read, was that a

Re: Phrase Frequency For Analysis

2006-06-22 Thread Nader Akhnoukh
Yes, Chris is correct, the goal is to determine the most frequently occuring phrases in a document compared to the frequency of that phrase in the index. So there are only output phrases, no inputs. Also performance is not really an issue, this would take place on an irregular basis and could ru

Re: Phrase Frequency For Analysis

2006-06-22 Thread Andrzej Bialecki
Chris Hostetter wrote: I think either you missunderstood Nader's question or I did: I belive the goal is to determine what the most frequently occuring phrases are -- not determine how frequently a particular input phrase appears. Isn't the latter a pre-requisite for the former ? ;) Regardi

Re: Phrase Frequency For Analysis

2006-06-22 Thread Chris Hostetter
: > I am trying to get the most frequently occurring phrases in a document and : > in the index as a whole. The goal is compare the two to get something like : > Amazon's SIPs. : Other than indexing the phrases directly, you could use a SpanNearQuery : over the words, use getSpans() on its SpanS

Re: Phrase Frequency For Analysis

2006-06-22 Thread Paul Elschot
On Thursday 22 June 2006 01:33, Nader Akhnoukh wrote: > Hi, I've looked through the archives and it looks like this question has > been asked in one form or another a few times, but without a satisfactory > solution. > > I am trying to get the most frequently occurring phrases in a document and >

Phrase Frequency For Analysis

2006-06-21 Thread Nader Akhnoukh
Hi, I've looked through the archives and it looks like this question has been asked in one form or another a few times, but without a satisfactory solution. I am trying to get the most frequently occurring phrases in a document and in the index as a whole. The goal is compare the two to get some