Re: SIPs and CAPs

2005-07-14 Thread mark harwood
> Do you just do this with terms or do you also > extract phrases? The scheme involves these phases: 1) Identify top terms (using algo described) 2) Identify all term "runs" in original text. 3) Identify sensible phrases from large list of term runs 4) Provide shortlist of top scoring terms AND

Re: SIPs and CAPs

2005-07-14 Thread Erik Hatcher
On Jul 14, 2005, at 7:17 AM, mark harwood wrote: I've done this by comparing term frequency in a subset (in Amazon's case a single book) and looking for a significant "uplift" in term popularity vs that of the general corpus popularity. Practically speaking, in the amazon case you can treat each

Re: SIPs and CAPs

2005-07-14 Thread mark harwood
I've done this by comparing term frequency in a subset (in Amazon's case a single book) and looking for a significant "uplift" in term popularity vs that of the general corpus popularity. Practically speaking, in the amazon case you can treat each page in the example book as a Lucene document, crea

SIPs and CAPs

2005-07-14 Thread Erik Hatcher
Has anyone developed code to extract SIPs (statistically improbable phrases) and CAPs (capitalized phrases) from a Lucene index, such as Amazon does with it's books as shown here?