In thinking about & discussing with Robert how to allow Lucene to
support other scoring models, eg lnu.ltc, BM25, etc.... I think a
relatively contained set of changes can give us a solid step forward.
Something like this:
* Store additional per-doc stats in the index, eg in a custom
posting list, including length in tokens of the field, avg tf, and
boost (boost can be efficiently stored so only if it differs from
default is it stored). Do not compute nor store norms in the
index. Merging would just concatenate these values (removing
deleted docs).
* Change IR so on open it generates norms dynamically, ie by walking
the stats, computing avgs (eg avg field length in tokens), and
computing the final per-field boost, casting to a 1-byte quantized
float. We may want to store aggregates in eg SegmentInfo to save
the extra pass on IR open...
* Change Similarity, to allow field-specific Similarity (I think we
have issue open for this already). I think, also, lengthNorm
(which is no longer invoked during indexing) would no longer be
used.
I think we'd make the class that computes norms from these per-doc
stats on IR open pluggable. And, someday we could make what stats are
gathered/stored during indexing pluggable but for starters I think we
should simply support the field length in tokens and avg tf per field.
Thoughts?
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]