Adding to this growing thread, there's really no reason to
index all the term bigrams, trigrams, etc. It's not
only slow, it's very memory/disk intensive. All you need
to do is two passes over the collection.
Pass One
Collect counts of bigrams (or trigrams, or whatever -- if
size is an
Nader Akhnoukh wrote:
Yes, Chris is correct, the goal is to determine the most frequently
occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irre
I may be coming into this thread without knowing enough. I have implemented a
phrase filter, which indexes all token sequences that are 2 to N tokens long.
The n is defined in the constructor.
It takes a stopword Trie for input because the policy I used, based on a publish
work I read, was that a
Yes, Chris is correct, the goal is to determine the most frequently occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irregular basis and could ru
Chris Hostetter wrote:
I think either you missunderstood Nader's question or I did: I belive the
goal is to determine what the most frequently occuring phrases are -- not
determine how frequently a particular input phrase appears.
Isn't the latter a pre-requisite for the former ? ;)
Regardi
: > I am trying to get the most frequently occurring phrases in a document and
: > in the index as a whole. The goal is compare the two to get something like
: > Amazon's SIPs.
: Other than indexing the phrases directly, you could use a SpanNearQuery
: over the words, use getSpans() on its SpanS
On Thursday 22 June 2006 01:33, Nader Akhnoukh wrote:
> Hi, I've looked through the archives and it looks like this question has
> been asked in one form or another a few times, but without a satisfactory
> solution.
>
> I am trying to get the most frequently occurring phrases in a document and
>
Hi, I've looked through the archives and it looks like this question has
been asked in one form or another a few times, but without a satisfactory
solution.
I am trying to get the most frequently occurring phrases in a document and
in the index as a whole. The goal is compare the two to get some