Time to pull out the chalkboard. :-)

SIPs, at least in the Amazon sense, are usually found
by means of statistical independence testing.  You
can find more info in Chris Manning's and Hinrich
Schuetze's statistical NLP book (heads-up: they're
now working on an IR book with more of a focus on
search quality than index compression -- draft chapters
are available at:

 http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html


The simplest independence testing formula is:

Z("t1 t2") = count("t1 t2") - expectedCount("t1 t2")
             ---------------------------------------
             sqrt(corpusSize * prob("t1 t2") * (1-prob("t1 t2"))

expectedCount("t1 t2") = corpusSize * prob("t1 t2")
prob("t1 t2") = P("t1") * P("t2")   [assumes independence]
P("t") = count("t")/corpusSize
corpusSize = total number of term instances in the corpus

This basically gives you the number of standard deviations
above the mean the actual count was compared to the expected
count if they were independent.  (It assumes a binomial term
distribution, which isn't quite right, but it's close enough.)

If your corpus is fixed, you can ignore the corpus size
for purposes of ranking.

So all you need to do is iterate over term pairs, look
at the count under a phrase search and compare to the
expected count.  The only real trick is to make that
search efficient.  The usual way to do that is to

1.  only consider phrases that appear a minimum number of times
2.  build a trie-structure to store the phrase counts

Next, you can do something similar if you have two
Lucene document collections, say a background and
foreground collection:

Z("t1 t2") = countFG("t1 t2") - fgCorpusSize * probBG("t1 t2")
             --------------------------------------
             sqrt(fgSize * pBG("t1 t2") * (1 - pBG("t1 t2"))

where countFG is the count in the foreground corpus,
fgCorpusSize is the size of the foreground corpus,
and probBG is the probability in the background
corpus.  Now what's nifty is that probBG("t1 t2") can
either be done with the background phrase count, or by further
factoring it into probBG("t1")*probBG("t2").  The former
gives you a measure of newness and the second a measure
of phrasalness.

Note that the foreground corpus can be as small as
a single document or a small cluster of documents.

Matthew Hurst suggested blending the phrase
probability probBG("t1 t2") with the foreground
independence probability probFG("t1")*probFG("t2")
to both find things that are new and that are
phrase-like.

I'm going to be writing this all up in a bit longer
form in a case study for the revised Lucene in Action,
with explanations of how to find the significant
terms relative to a query, like Scirus.com does.

- Bob Carpenter
  Alias-i

PS:  I also wrote a little longer blog entry with more math,
references, and a comparison to what Matthew Hurst at Nielsen/Buzzemtrics
is doing:

     http://www.alias-i.com/blog/?p=14

(But our blog server appears to be down, and the sysadmin's
out to lunch...)


Larry Ogrodnek wrote:
One thing that I played with was creating multiple phrase indexes, one
each for 2, 3, 4, and 5 words.  I wrote a tokenizer that would batch up
the words, so, for the input string:

The quick brown fox jumps over the slow lazy dog.

The tokenizer for 3 words would return:

The quick brown
Quick brown fox
Brown fox jumps
Fox jumps over
...

This seemed like a reasonably start... the problem is resolving the
overlap for display, and figuring out which words are the most
important, e.g. if the above sentence itself was pretty rare, and you're
looking at the phrase-index-3, each one of its sub-phrases would end up
being significant.... Which one do you show?  Or do you combine them
into a longer phrase?  If so, where do you stop?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to