RE: idf and explain(), was Re: Search and Scoring

Chuck Williams Mon, 18 Oct 2004 19:09:45 -0700

Doug Cutting wrote:
  > If this is a big issue for you, as it seems it is, please submit a
patch
  > to optionally disable score normalization in Hits.java.
and:
  > The quantity 'sum(t) weight(t,d)^2' must be recomputed for each
document
  > each time a document is added to the collection, since 'weight(t,d)'
is
  > dependent on global term statistics.  This is prohibitivly
expensive.
  > Research has also demonstrated that such cosine normalization gives
  > somewhat inferior results (e.g., Singhal's pivoted length
normalization).

I'm willing to write, test and contribute code to address the
normalization issue, i.e. to yield scores in [0, 1] that are meaningful
across searches.  Unfortunately, this is considerably more involved that
just optionally eliminating the current normalization in Hits.  Before
undertaking this, I'd like to see if there is agreement in principle
that this is a good idea, and that my specific proposal below is the
right way to go.  Also, I'd like to make sure I've correctly inferred
the constraints on writing code to be incorporated into Lucene.

After looking at this in more detail I agree that the cosine
normalization is not the way to go, because of both efficiency and
effectiveness considerations.  A conceptual approach that would be
efficient, relatively easy to implement, and seems to have at least
reasonable behavior would be to define the top scoring match to have a
score roughly equal to the percentage of query terms it matches (its
"netCoord").  Scores below the top hit would be reduced based on the
ratio of their raw scores to the raw score of the top hit (considering
all of the current score factors, except that I'd like to remove one of
the idf factors, as discussed earlier).

For a couple simple cases:
  a) the top match for a single term query would always have a score of
1.0, 
  b) the top scoring match for a BooleanQuery using DefaultSimilarity
with all non-prohibited TermQuery clauses would have a score of m/n,
where the hit matches m of the n terms.

This isn't optimal, but seems much better than the current situation.
Consider two single-term queries, s and t.  If s matches more strongly
than t in its top hit (e.g., a higher tf in a shorter field), it would
be best if the top score of s was greater score than top score of t.
But this kind of normalization would require keeping additional
statistics that so far as I know are not currently in the index, like
the maximum tf for terms and the minimum length for fields.  These could
be expensive to update on deletes.  Also, normalizing by such factors
could yield lower than subjectively reasonable scores in most cases, so
it's not clear it would be better.

The semantics above are at least clean, easy to understand, and support
what seems to me is the most important motivation to do this:  allowing
an application to use simple thresholding to segregate
likely-to-be-relevant hits from likely-to-be-irrelevant hits.

More specifically, for a BooleanQuery of TermQuery's the resulting score
functions would be:

BooleanQuery of TermQuery's sbq = (tq1 ... tqn)

baseScore(sbq, doc) =
  sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
           lengthNorm(tqi.term.field, doc)

rawScore(sbq, doc) = coord(sbq, doc) * baseScore

norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)

score(sbq, doc) = rawScore * norm

rawScore's can be computed in the Scorer.score() methods and therefore
used to sort the hits (e.g., in the instance method for collect() in the
HitCollector in IndexSearcher.search()).  If the top scoring hit does
not have the highest baseScore, then its score could be less that its
coord; this seems desirable.  These formulas imply that no result score
can be larger than its coord, so if coord is well-defined (always
between 0 and 1) then all results will be normalized between 0 and 1.

In general, the netCoord, which takes the place of coord in the simple
case above, needs to be defined for any query.  Conceptually, this
should be the total percentage of query terms matched by the document.
It must be recursively computable from the subquery structure and
overridable for specific Query types (e.g., to support specialized
coords, like one that is always 1.0 as is useful in multi-field
searching).  Suitable default definitions for TermQuery and BooleanQuery
are:

TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise

BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * ci.netCoord

This is not quite percentage of terms matched; e.g., consider a
BooleanQuery with two clauses, one of which is a BooleanQuery of 3 terms
and the other which is just a term.  However, it doesn't seem to be
unreasonable for a BooleanQuery to state that its clauses are equally
important, and this is consistent with the current coord behavior.
BooleanQuery.netCoord could be overridden for special cases, like the
pure disjunction I use in my app for field expansions:
MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord

Implementing this would proceed along these lines:
  1.  For backwards compatibility, add some kind of newScoring boolean
setting.
  2.  Update all of these places to behave as indicated if newScoring:
    a.  Change Query.weight() to not do any normalization (no call to
sumOfSquaredWeights(), Similarity.queryNorm() or normalize()).
    b.  Update all Query.weight classes to set their value according to
the terms in the score formula above that don't involve the document
(e.g., boost*idf for TermQuery).
    c.  Add suitable netCoord definitions to all Scorer classes.
    d.  Update all Scorer.score() methods to compute the rawScore as
above.
    e.  Add the maximum baseScore as a field kept on TopDocs, computed
in the HitCollector's.
    f.  Change the normalization in Hits to always divide every raw
score by the maximum baseScore.
    g.  Update all of the current explain() methods to be consistent
with this scoring, and to either report the rawScore, or to report the
final score if the normalization factor is provided.
    h.  Add Hits.explain() (or better extend Searcher so that it keeps
the Hits for use in Searcher.explain()) to call the new explain
variation with the normalization factor so that final scores are fully
explained.

If this seems like a good idea, please let me know.  I'm sure there are
details I've missed that would come out during coding and testing.
Also, the value of this is dependent on how reasonable the final scores
look, which is hard to tell for sure until it is working.

The coding standards for Lucene seem reasonably clear from the source
code I've read.  I could use just simple Java so there shouldn't be any
significant JVM dependencies.  The above should be fully backward
compatible due to the newScoring flag.

On another note, I had to remove the German analyzer in my current 1.4.2
source configuration because GermanStemmer failed to compile due to what
are apparently Unicode character constants that I've now got as illegal
two-character character constants.  This is presumably an encoding
problem somewhere that I could track down.  It's not important, but if
the answer is obvious to any of you, I'd appreciate the quick tip.

Thanks,

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Monday, October 18, 2004 9:44 AM
  > To: Lucene Developers List
  > Subject: Re: idf and explain(), was Re: Search and Scoring
  > 
  > Chuck Williams wrote:
  > > That's a good point on how the standard vector space inner product
  > > similarity measure does imply that the idf is squared relative to
the
  > > document tf.  Even having been aware of this formula for a long
time,
  > > this particular implication never occurred to me.  Do you know if
  > > anybody has done precision/recall or other relevancy empirical
  > > measurements comparing this vs. a model that does not square idf?
  > 
  > No, not that I know of.
  > 
  > > Regarding normalization, the normalization in Hits does not have
very
  > > nice properties.  Due to the > 1.0 threshold check, it loses
  > > information, and it arbitrarily defines the highest scoring result
in
  > > any list that generates scores above 1.0 as a perfect match.  It
would
  > > be nice if score values were meaningful independent of searches,
e.g.,
  > > if "0.6" meant the same quality of retrieval independent of what
  > search
  > > was done.  This would allow, for example, sites to use a a simple
  > > quality threshold to only show results that were "good enough".
At my
  > > last company (I was President and head of engineering for
InQuira), we
  > > found this to be important to many customers.
  > 
  > If this is a big issue for you, as it seems it is, please submit a
patch
  > to optionally disable score normalization in Hits.java.
  > 
  > > The standard vector space similarity measure includes
normalization by
  > > the product of the norms of the vectors, i.e.:
  > >
  > >   score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) ) /
  > >                 sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
  > weight(t,d)^2) ]
  > >
  > > This makes the score a cosine, which since the values are all
positive,
  > > forces it to be in [0, 1].  The sumOfSquares() normalization in
Lucene
  > > does not fully implement this.  Is there a specific reason for
that?
  > 
  > The quantity 'sum(t) weight(t,d)^2' must be recomputed for each
document
  > each time a document is added to the collection, since 'weight(t,d)'
is
  > dependent on global term statistics.  This is prohibitivly
expensive.
  > Research has also demonstrated that such cosine normalization gives
  > somewhat inferior results (e.g., Singhal's pivoted length
normalization).
  > 
  > > Re. explain(), I don't see a downside to extending it show the
final
  > > normalization in Hits.  It could still show the raw score just
prior
  > to
  > > that normalization.
  > 
  > In order to normalize scores to 1.0 one must know the maximum score.
  > Explain only computes the score for a single document, and the
maximum
  > score is not known.
  > 
  >  > Although I think it would be best to have a
  >  > normalization that would render scores comparable across
searches.
  > 
  > Then please submit a patch.  Lucene doesn't change on its own.
  > 
  > Doug
  > 
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: idf and explain(), was Re: Search and Scoring

Reply via email to