Yes, its a good point. I'm coming at it from a more pure angle. And I'm
not so elegant in my thought patterns :)

Right though - our document vector normalization is - uh - quick and
dirty :) Its about the cheapest one I've seen other than root(length).

I don't think that scores between queries are very comparable in general
in Lucene  either- but they would be even less so if we dropped the
query norm. As I've argued in the past - if it had any real perf hit,
I'd be on the side of dropping it - but from what I can see, it really
doesn't, so I don't see why we should further skew the scores.

Jake Mannix wrote:
> Remember: we're not really doing cosine at all here.  The factor of
> IDF^2 on
> the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom
> couples
> together to end up with the following effect:
>
>  q1 = "TERM1"
>  q2 = "TERM2"
>
> doc1 = "TERM1"
> doc2 = "TERM2"
>
> score(q1, doc1) = idf(TERM1)
> score(q2, doc2) = idf(TERM2)
>
> Both are perfect matches, but one scores higher (possibly much higher)
> than
> the other.
>
> Boosts work just fine with cosine (it's just a way of putting "tf"
> into the query side
> as well as in the document side), but normalizing documents without
> taking the
> idf of terms in the document into consideration blows away the ability to
> compare scores in default Lucene scoring, even *with* the queryNorm()
> factored
> in.
>
> I know you probably know this Mark, but it's important to make sure
> we're stating
> that in Lucene as is currently structured, scores can be *wildly*
> different between
> queries, even with queryNorm() factored in, for the sake of people
> reading this
> who haven't worked through the scoring in detail.
>
>   -jake
>  
>
> On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller <markrmil...@gmail.com
> <mailto:markrmil...@gmail.com>> wrote:
>
>     Grant Ingersoll wrote:
>     >
>     >  What I would like to get at is why anyone thinks scores are
>     > comparable across queries to begin with.
>     >
>     They are somewhat comparable because we are using the approximate
>     cosine
>     between the document/query vectors for the score - plus boosts n
>     stuff.
>     How close the vectors are to each other. If q1 has a smaller angle
>     diff
>     with d1 than q2 does with d2, then you can do a comparison. Its just
>     vector similarities. Its approximate because we fudge the
>     normalization.
>     Why do you think the scores within a query search are comparable?
>     Whats
>     the difference when you try another query? The query is the
>     difference,
>     and the query norm is what makes it more comparable. Its just a
>     different query vector with another query. Its still going to just
>     be a
>     given "angle" from the doc vectors. Closer is considered a better
>     match.
>     We don't do it to improve anything, or because someone discovered
>     something - its just part of the formula for calculating the
>     cosine. Its
>     the dot product formula. You can lose it and keep the same relative
>     rankings, but then you are further from the cosine for the score - you
>     start scaling by the magnitude of the query vector. When you do that
>     they are not so comparable.
>
>     If you take out the queryNorm, its much less comparable. You are
>     effectively multiplying the cosine by the magnitude of the query
>     vector
>     - so different queries will scale the score differently - and not in a
>     helpful way - a term vector and query vector can have very different
>     magnitudes, but very similar term distributions. Thats why we are
>     using
>     the cosine rather than euclidean distance in the first place. Pretty
>     sure its more linear algebra than IR - or the vector stuff from calc 3
>     (or wherever else different schools put it).
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>     <mailto:java-dev-unsubscr...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-h...@lucene.apache.org
>     <mailto:java-dev-h...@lucene.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to