Re: Lucene scoring: Term frequency normalisation

Marvin Humphrey Tue, 12 Dec 2006 05:58:09 -0800


On Dec 12, 2006, at 2:23 AM, Karl Koch wrote:

However, what exactly is the advantage of using sqare root insteadof log?

Speaking anecdotally, I wouldn't say there's an advantage. There's apredictable effect: very long documents are rewarded, since thedamping factor is not as strong. For most of the engines I've built,that hasn't been desirable.

In order to get optimal results for large collections, it's oftennecessary to customize this, by overriding lengthNorm. IME, forsearching general content such as random html documents, the bodyfield needs a higher damping factor, but more importantly a plateauat the top end to prevent very short documents from dominating theresults.


  public float lengthNorm(String fieldName, int numTerms) {
    numTerms = numTerms < 100 ? 100 : numTerms;
    return (float)(1.0 / Math.sqrt(numTerms));
  }

In contrast, you don't want the plateau for title fields, assumingthat malignant keyword stuffing isn't an issue.


This stuff is corpus specific, though.

http://www.mail-archive.com/java-user@lucene.apache.org/msg08496.html

Is there any scientific reason behind this? Does anybody know apaper about this issue?


Here's one from 1997:

Lee, Chuang, and Seamons: "Document Ranking and the Vector Space Model"
http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf

Is there perhaps another discussion thread in here which I have notseen.


http://www.mail-archive.com/java-dev@lucene.apache.org/msg04509.html
http://www.mail-archive.com/java-dev@lucene.apache.org/msg01704.html

Searching the mail archives for "lengthNorm" will turn up some more.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene scoring: Term frequency normalisation

Reply via email to