Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 4:09 PM, Mark Miller <markrmil...@gmail.com
> <mailto:markrmil...@gmail.com>> wrote:
>  
>
>     But cosine has two norms? The query norm and the document norm -
>     taking
>     the two vectors to the unit space - it looks expensive to me to do
>     both
>     of them properly. The IR lit I've seen fudges them down to Root(L)
>     even
>     in the non incremental case for the document vector normalization.
>
>
> Lucene has two norms too: lengthNorm (of the field), produced at index
> time,
> and queryNorm.  The expense of queryNorm is done once per query, not
> once per document, and so as you yourself say, is not a performance
> problem.  The lengthNorm is factored into the document at index time, and
> so is not a hit at *query time* at all
Right - Lucene's two norms come from the cosine similarity. The
queryNorm is our version of the Euclidian length of the query vector.
The lengthNorm is our version of the Euclidian length of the document
vector. I don't think tf/idf systems that use cosine similarity, even
less fudged cosine similarity, generally produce scores that are
actually the cosine - the cosine similarity is just a factor to help
remove document length from the score, not actually produce the score.
And there are a few different options for how the document length
normalization is done. I've never seen it done correctly in lit though -
its always an approximation.

I know lengthNorm is not a hit at query time - but it can be a
performance issue at index time as well. And can be even worse with
incremental indexing as you said, but impossible?
>  
>
>     > (the norm Lucene does
>     > use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 *
>     > idf(term_i)^2)) -
>     > is impossible to calculate.
>     But isn't that *not* the query side norm? Looks like the document
>     vector
>     norm to me.
>
>
> This is exactly the query norm - 1/sqrt(sumOfSquaredWeights), and
> sum of squared weights is just sum of tf^2 * idf^2, except queries have
> "boost" in place of "tf".  The document vector norm that *Lucene* uses is
> independent of idf of the terms in the document (it's 1/sqrt(doc_length)
> for DefaultSimilarity).
Right - but thats not the real doc norm. Thats a fake. You won't get the
real cosine with it. You'd have to do the real document norm. I was
talking not in terms of Lucene, but just cosine similarity in general.
>  
>
>     Where are you doing the correct document vector normalization to
>     get the
>     true cosine? I'm not following. It looks like the above formula is the
>     doc vector norm - so what about the query vector norm then?
>
>
> Ok, the technique I use to do this (I should really just refactor it
> to a point
> where I can post it for contrib/queries so that other people can use this
> if it's not that commonly known), is to, at index time, calculate the true
> document vector normalizations for the fields, and then feed it into
> their
> boosts, and make sure that Similarity does *not* do field
> normalization itself.
>
> The formula for query vector normalization and document vector
> normalization
> is *exactly* the same if you translate boosts of terms into
> termFrequency of
> said term in the query "document".  It's just applied to a different
> bag of
> words.
>  
>
>     >   a) angles,
>     >   b) numbers which are less than one, and
>     >   c) when they're equal to 1, you have a perfect match.
>
>  
>
>     When they think of cosine yes - but IR lit seems to fudge this -
>     non of
>     the common document length normalizations are real cosine - they are
>     just square root variations - with Lucene's being the most simple
>     in the
>     family.
>
>
> This is true, which is why I'm not really giving a better wording than
> what
> Doron already wrote - I doubt I could do better.  I'm concerned myself
> because I've seen lots of questions over the years on the lists of people
> asking about how to compare scores across queries, no matter how much
> we tell them not to.  And now here again we're telling them "they're not
> comparable, but they're more comparable than if we didn't have this
> queryNorm in there!".  What exactly does that mean?  I do search, and
> I'm not really sure what it means, what does someone who is just learning
> this stuff going to think?
Who knows? Most of us working on this don't know - good luck to those
just learning :)
>  
>
>
>     >
>     > I think it's way better than it used to be (although it's even more
>     > dense, which
>     > I guess is required, given that it's a dense subject), but I really
>     > don't think
>     > we want to let people think it's a cosine - it's "like" a
>     cosine, but
>     > with a
>     > rather different normalization.
>
>  
>
>     We can amp that - but I like I said, I think thats standard practice -
>     in the IR books I've seen, they say, here is the cosine formula
>     for term
>     vector scoring, and then they go on to fudge in a manner very
>     similar to
>     how Lucene does it.
>
>
> Well just because it's standard practice, doesn't mean it's a good idea.
> I came to search from a math background, not CS, so when I see "vector
> space" and "cosine", I maybe think different things than the average coder
> learning IR.  So maybe I'm not the right person to ask how such things
> should be explained!
As good as anyone else here :) I'm taking what I'm thinking of it from
IR books I'm looking at though. They don't appear to consider cosine
similarity to require the score actually be the cosine - its just a
factor in the score. And they fudge that factor with shortcuts.
>  
>
>     > Doron's got it stated like that, already, but
>     > it needs to be crystal clear that scores are sometimes higher
>     than 1,
>     > and in
>     > fact the "best" score for one query may be rather different than the
>     > "best"
>     > for another query.
>
>  
>
>     I think this is general knowledge with Lucene, but that may be a
>     stretch
>     - perhaps its not as wide know as I now assume for new users. Hits
>     used
>     to normalize to a 0-1 score actually, but its no longer around.
>
>
> It was definitely a good idea to get rid of normalizing by the top
> score, but
> what could replace it is normalizing by the "best possible score". 
> This tells
> you how close to a perfect match you could have gotten.  For cosine,
> this would
> just be the cosine of the angle from the document vector to the query
> vector,
> for Lucene, it would be something similar, but would have absolute
> meaning
> (maybe... maybe not, because the theoretical best score is likely to
> be far
> from the actual best score in a given corpus).
>
>   -jake


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to