On Mar 23, 2006, at 11:22 AM, Otis Gospodnetic wrote:
The place to start would be to look at the DefaultSimilarity, and
the norms method there. Perhaps you want to create your own
Similarity implementation that returns either a constant 1 or
something else that will favour longer text. Somebody else with
more experience in this area may have better or more precise
suggestions.
Here's an implementation of lengthNorm() that stops stops the
weighting at 100 tokens.
public float lengthNorm(String fieldName, int numTerms) {
numTerms = numTerms < 100 ? 100 : numTerms;
return (float)(1.0 / Math.sqrt(numTerms));
}
If you adopt it, you must boost short but important fields (e.g.
title), or they won't contribute enough.
KinoSearch (my loose Perl/C port of Lucene) uses this algorithm, and
it seems to work well.
To see an earlier discussion on this subject perform a web search for
"proposal defaultsimilarity lengthnorm".
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]