Now hell would be the place for me where I would have to prove that Lucene's ranking is exactly equivalent to some transformation of vector space and then using the *cosine* for the ranking. Can't be really, as Lucene sometimes returns results > 1.0 and only some ruthless
normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough corners
in Lucene where a good theorist could find some work.
One problem with computing the theoretically correct normalization is that it changes each time the index is modified, and must be recomputed for every document. This is because it includes IDFs, and, when the index changes, IDFs change. That's when you normalize tf/idf vectors to the unit sphere, i.e., norm=sqrt(sum_t(weight(t)^2)).
But research has also shown that this mathematically correct normalization is not the best, e.g. Singhal et. al.'s work on pivoted normalization:
http://citeseer.ist.psu.edu/singhal96pivoted.html
More generally, this shows that information retreival is not a theoretical field, but rather a heuristic one. (Although someone may flame me for saying that...) The vector space model is just a useful analogy, not a verifiable theory of document meaning. It suggests some formulae which can be improved through inspiration and experimentation.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]