Back to Grant's original question, for a second...

On Fri, Nov 20, 2009 at 1:59 PM, Grant Ingersoll <gsing...@apache.org>wrote:


> This makes sense from a mathematical sense, assuming scores are comparable.
>  What I would like to get at is why anyone thinks scores are comparable
> across queries to begin with.  I agree it is beneficial in some cases (as
> you described) if they are.   Probably a question suited for an academic IR
> list...
>

Well, without getting into the academic IR which I'm not really qualified to
argue about, what is wrong with comparing two queries by saying that a
document which "perfectly" matches a query should score 1.0, and scale with
respect to that?

Maybe it's a better question to turn it around: can you give examples of two
queries where you can see that it *doesn't* make sense to compare scores?
Let's imagine we're doing pure, properly normalized tf-idf cosine scoring
(not default Lucene scoring) on a couple of different fields at once.  Then
whenever a sub-query is exactly equal to the field it's hitting (or else the
field is the repetition of that query some multiple number of times), the
score for that sub-query will be 1.0.  When the match isn't perfect, the
score will go down, ok.  Sub-queries hitting longer fields (which aren't
just pathologically made up of just repetitions of a smaller set of terms)
will in general have even the best scores be very low compared to the best
scores on the small fields (this is true for Lucene as well, of course), but
this makes sense: if you query with a very small set of terms (as is usually
done, unless you're doing a MoreLikeThis kind of query), and you find a
match in the "title" field which is exactly what you were looking for, that
field match is far and away better than anything else you could get in a
body match.

To put it more simply - if you do really have cosine similarity (or
Jaccard/Tanimoto or something like that, if you don't care about idf for
some reason), then queries scores are normalized relative to "how close did
I find documents to *perfectly* matching my query" - 1.0 means you found
your query in the corpus, and less than that means some fractional
proximity.  This is an absolute measure, across queries.

Of course, then you ask, well, in reality, in web and enterprise search,
documents are big, queries are small, you never really find documents which
are perfect matches, so if the best match for q1, out of your whole corpus,
is 0.1 for doc1, and the best match for q2 is 0.25 for doc2, is it really
true that the best match for the second query is "better" than the best
match for the first query?  I've typically tried to remain agnostic on that
front, and instead as the related question: if the user (or really, a
sampling of many users) queried for (q1 OR q2) and assuming for simplicity
that q1 didn't match any of the good hits for q2, and vice-versa, then does
the user (ie. your gold-standard training set) say that the best result is
doc1, or doc2?  If it's doc1, then you'd better have found a way to boost
q1's score contribution higher than q2's, right?  Is this wrong?  (in the
theoretical sense?)

  -jake

Reply via email to