Yes, its a good point. I'm coming at it from a more pure angle. And I'm not so elegant in my thought patterns :)
Right though - our document vector normalization is - uh - quick and dirty :) Its about the cheapest one I've seen other than root(length). I don't think that scores between queries are very comparable in general in Lucene either- but they would be even less so if we dropped the query norm. As I've argued in the past - if it had any real perf hit, I'd be on the side of dropping it - but from what I can see, it really doesn't, so I don't see why we should further skew the scores. Jake Mannix wrote: > Remember: we're not really doing cosine at all here. The factor of > IDF^2 on > the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom > couples > together to end up with the following effect: > > q1 = "TERM1" > q2 = "TERM2" > > doc1 = "TERM1" > doc2 = "TERM2" > > score(q1, doc1) = idf(TERM1) > score(q2, doc2) = idf(TERM2) > > Both are perfect matches, but one scores higher (possibly much higher) > than > the other. > > Boosts work just fine with cosine (it's just a way of putting "tf" > into the query side > as well as in the document side), but normalizing documents without > taking the > idf of terms in the document into consideration blows away the ability to > compare scores in default Lucene scoring, even *with* the queryNorm() > factored > in. > > I know you probably know this Mark, but it's important to make sure > we're stating > that in Lucene as is currently structured, scores can be *wildly* > different between > queries, even with queryNorm() factored in, for the sake of people > reading this > who haven't worked through the scoring in detail. > > -jake > > > On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller <markrmil...@gmail.com > <mailto:markrmil...@gmail.com>> wrote: > > Grant Ingersoll wrote: > > > > What I would like to get at is why anyone thinks scores are > > comparable across queries to begin with. > > > They are somewhat comparable because we are using the approximate > cosine > between the document/query vectors for the score - plus boosts n > stuff. > How close the vectors are to each other. If q1 has a smaller angle > diff > with d1 than q2 does with d2, then you can do a comparison. Its just > vector similarities. Its approximate because we fudge the > normalization. > Why do you think the scores within a query search are comparable? > Whats > the difference when you try another query? The query is the > difference, > and the query norm is what makes it more comparable. Its just a > different query vector with another query. Its still going to just > be a > given "angle" from the doc vectors. Closer is considered a better > match. > We don't do it to improve anything, or because someone discovered > something - its just part of the formula for calculating the > cosine. Its > the dot product formula. You can lose it and keep the same relative > rankings, but then you are further from the cosine for the score - you > start scaling by the magnitude of the query vector. When you do that > they are not so comparable. > > If you take out the queryNorm, its much less comparable. You are > effectively multiplying the cosine by the magnitude of the query > vector > - so different queries will scale the score differently - and not in a > helpful way - a term vector and query vector can have very different > magnitudes, but very similar term distributions. Thats why we are > using > the cosine rather than euclidean distance in the first place. Pretty > sure its more linear algebra than IR - or the vector stuff from calc 3 > (or wherever else different schools put it). > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > <mailto:java-dev-unsubscr...@lucene.apache.org> > For additional commands, e-mail: java-dev-h...@lucene.apache.org > <mailto:java-dev-h...@lucene.apache.org> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org