Hey everyone,

I have a question about Lucene/Solr scoring in general. There are many
factors at play in the final score for each document, and very often one
factor will completely dominate everything else when that may not be the
intention.

** The question: might there be a way to enforce strict requirements that
certain factors are higher priority than other factors, and/or certain
factors shouldn't overtake other factors? Perhaps a set of rules where one
factor is considered before even examining another factor? Tuning boost
numbers around and hoping for the best seems imprecise and very fragile. **

To make this more concrete, an example:

We previously added the scores of multi-field matches together via an OR,
so: score(query "apple") = score(field1:apple) + score(field2:apple). I
changed that to be more in-line with DisMaxParser, namely a max: score(query
"apple") = max(score(field1:apple), score(field2:apple)). I also modified
coord such that coord would only consider actual unique terms ("apple" vs.
"orange"), rather than terms across multiple fields (field1:apple vs.
field2:apple).

This seemed like a good idea, but it actually introduced a bug that was
previously hidden. Suddenly, documents matching "apple" in the title and
*nothing* in the body were being boosted over documents matching "apple" in
the title and "apple" in the body! I investigated, and it was due to
lengthNorm: previously, documents matching "apple" in both title and body
were getting very high scores and completely overwhelming lengthNorm. Now
that they were no longer getting *such* high scores, which was beneficial in
many respects, they were also no longer overwhelming lengthNorm. This
allowed lengthNorm to dominate everything else.

I'd love to hear your thoughts :)

Tavi

Reply via email to