Grant Ingersoll wrote:
>
>  What I would like to get at is why anyone thinks scores are
> comparable across queries to begin with.
>
They are somewhat comparable because we are using the approximate cosine
between the document/query vectors for the score - plus boosts n stuff.
How close the vectors are to each other. If q1 has a smaller angle diff
with d1 than q2 does with d2, then you can do a comparison. Its just
vector similarities. Its approximate because we fudge the normalization.
Why do you think the scores within a query search are comparable? Whats
the difference when you try another query? The query is the difference,
and the query norm is what makes it more comparable. Its just a
different query vector with another query. Its still going to just be a
given "angle" from the doc vectors. Closer is considered a better match.
We don't do it to improve anything, or because someone discovered
something - its just part of the formula for calculating the cosine. Its
the dot product formula. You can lose it and keep the same relative
rankings, but then you are further from the cosine for the score - you
start scaling by the magnitude of the query vector. When you do that
they are not so comparable.

If you take out the queryNorm, its much less comparable. You are
effectively multiplying the cosine by the magnitude of the query vector
- so different queries will scale the score differently - and not in a
helpful way - a term vector and query vector can have very different
magnitudes, but very similar term distributions. Thats why we are using
the cosine rather than euclidean distance in the first place. Pretty
sure its more linear algebra than IR - or the vector stuff from calc 3
(or wherever else different schools put it).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to