On Mar 23, 2006, at 11:22 AM, Otis Gospodnetic wrote:

The place to start would be to look at the DefaultSimilarity, and the norms method there. Perhaps you want to create your own Similarity implementation that returns either a constant 1 or something else that will favour longer text. Somebody else with more experience in this area may have better or more precise suggestions.

Here's an implementation of lengthNorm() that stops stops the weighting at 100 tokens.

  public float lengthNorm(String fieldName, int numTerms) {
    numTerms = numTerms < 100 ? 100 : numTerms;
    return (float)(1.0 / Math.sqrt(numTerms));
  }

If you adopt it, you must boost short but important fields (e.g. title), or they won't contribute enough.

KinoSearch (my loose Perl/C port of Lucene) uses this algorithm, and it seems to work well.

To see an earlier discussion on this subject perform a web search for "proposal defaultsimilarity lengthnorm".

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to