On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller <markrmil...@gmail.com> wrote:
> Jake Mannix wrote: > > Remember: we're not really doing cosine at all here. > This, I think, is fuzzy right? It seems to be common to still call this > cosine scoring loosely - pretty much every practical impl fudges things > somewhat when doing the normalization (though we are on the heavy side > of fudgers) - I think its pretty rare to do the true cosine because its > so expensive. It can be somewhat misleading though. > Cosine isn't expensive - it's just *impossible* to do when you're doing incremental indexing: to normalize to actually do cosine similarity requires knowing the idf at *index time*, so that the cosine norm (the norm Lucene does use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 * idf(term_i)^2)) - is impossible to calculate. Incremental indexing has it's costs! In cases where you *do* have either the exact idf values for your corpus's term-vocabulary (or a good way to approximate it, since it only varies logarithmically, this isn't too hard), then if you override lengthNorm to always return 1, and then take your document's Fields, and boost() them by the factor I listed above, then you can get exact (well, as exact as you can with only one-byte field-norms :\ [are we really that starved for RAM that we have to have that little precision?] ) cosine similarity, at really no additional *query-time* performance hit. Just a little math done at indexing time. Have you looked at the Similarity scoring explanation page that was > recently improved? Have any suggestions on changes to it? Doron put a > fair amount of work into improving it recently, but I think it could > always get better. Its currently leaning towards presenting this as > cosine - that seems in line with the few text books I've seen, but I'm > admittedly not that deep into any of this. > Yeah, that page is a lot better than it used to be, but I'd really try not to call this cosine - when people think of cosines, they think of: a) angles, b) numbers which are less than one, and c) when they're equal to 1, you have a perfect match. We have a) - kinda. We do not have b) or c), and we don't have scores which are really comparable, and maybe Grant's question should be answered with the statement: they shouldn't be: If I query the google with "web" (getting a little less than three billion hits, so it maybe has an idf of 3 or 4), and find a document with 10 terms, one of which is "web" (and the others are equally common in terms of idf), then lucene says we should score that as: idf(web) / sqrt(10) =~ 0.9, while cosine would say: idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2 + ... ) =~ 1/sqrt(10) = 0.31 If I instead query with "floccinaucinilipilification" (which comes up with about 45K hits, so maybe has an idf of 20), and find a document with 10 terms, one of which is floccinaucinilipilification, and the other terms are also similarly rare (also idf 20), lucene says the score is idf(floccinaucinilipilification) / sqrt(10), = 6.5 while cosine would say: idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 + tf*rareTerm2^2 + ... ) =~ 1/sqrt(10) = 0.31 What is nice about *real* cosine is that it captures a sense of absolute scoring - the hit on the really rare word in a document which does not have a lot of other rare words (and is in general pretty short) is indeed measured by the fact that the score is close to 1.0 (perfect match), whereas the Lucene default scoring does not give a good measure - web scores close to 1.0 on its hit, and you only notice that this isn't as good as the rare word hit because that one scores way *higher* than 1.0. On the other hand, the thing which Lucene scoring does which cosine does *not* is measure the fact that hitting the rare word is *way* better than hit on a common word, from a user perspective, so giving them equal scores because their cosines are the same is maybe not the right thing to do (and so "yay Doug!", heh). Have you looked at the Similarity scoring explanation page that was recently improved? Have any suggestions on changes to it? Doron put a fair amount of work into improving it recently, but I think it could always get better. Its currently leaning towards presenting this as cosine - that seems in line with the few text books I've seen, but I'm admittedly not that deep into any of this. I think it's way better than it used to be (although it's even more dense, which I guess is required, given that it's a dense subject), but I really don't think we want to let people think it's a cosine - it's "like" a cosine, but with a rather different normalization. Doron's got it stated like that, already, but it needs to be crystal clear that scores are sometimes higher than 1, and in fact the "best" score for one query may be rather different than the "best" for another query. -jake