On Jul 14, 2006, at 7:42 AM, Yonik Seeley wrote:
On 7/14/06, Rob Staveley (Tom) <[EMAIL PROTECTED]> wrote:
What would I lose by omitting norms? The ability to boost
individual fields
as they are added to the index? Anything else?
Length normalization of the field. Full-text matches on shorter
fields score higher because the match is seen as more specific. You
loose that if you omit norms. That's typically OK for short fields
like "title" anyway, and fields that aren't full-text (like dates,
numbers, etc).
Yonik, I disagree on one point. I recommend against omitting norms
for title fields.
Without norms, the titles "Duke Ellington" and "Duke Ellington meets
Count Basie" will contribute equally to their respective document
scores on a search for "Duke Ellington". For most applications,
exact title matches should win, so that's not optimal.
KinoSearch adopted a default tf() truncation scheme where all fields
were normalized as if they contained a minimum of 100 tokens. That
achieved the desired outcome of stopping very short documents from
scoring inappropriately high, but even with a boost assigned to a
title field, I've found that I can't get really good IR precision
without going back to a non-truncating tf() for title.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]