On Dec 12, 2006, at 2:23 AM, Karl Koch wrote:
However, what exactly is the advantage of using sqare root instead of log?

Speaking anecdotally, I wouldn't say there's an advantage. There's a predictable effect: very long documents are rewarded, since the damping factor is not as strong. For most of the engines I've built, that hasn't been desirable.

In order to get optimal results for large collections, it's often necessary to customize this, by overriding lengthNorm. IME, for searching general content such as random html documents, the body field needs a higher damping factor, but more importantly a plateau at the top end to prevent very short documents from dominating the results.

  public float lengthNorm(String fieldName, int numTerms) {
    numTerms = numTerms < 100 ? 100 : numTerms;
    return (float)(1.0 / Math.sqrt(numTerms));
  }

In contrast, you don't want the plateau for title fields, assuming that malignant keyword stuffing isn't an issue.

This stuff is corpus specific, though.

http://www.mail-archive.com/java-user@lucene.apache.org/msg08496.html

Is there any scientific reason behind this? Does anybody know a paper about this issue?

Here's one from 1997:

Lee, Chuang, and Seamons: "Document Ranking and the Vector Space Model"
http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf

Is there perhaps another discussion thread in here which I have not seen.

http://www.mail-archive.com/java-dev@lucene.apache.org/msg04509.html
http://www.mail-archive.com/java-dev@lucene.apache.org/msg01704.html

Searching the mail archives for "lengthNorm" will turn up some more.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to