I cannot comment on the "marked-as-deleted" documents, but for the
approach I outlined: this might impact the scores. I prefer to say
'impact' instead of 'skew', because to me 'skew' would imply that the
original scores are some kind of ideal state which is distorted. I don't
think this is necessarily the case with term weight shifts.
It really depends on the specific setup. If there are millions of
documents in the index, and some of them are in there ten times and
others a hundred times in terms of their contribution to statistical
figures (not real physical multiple instances), I don't think this would
lead to a significant change overall. With a large index, I would be
surprised if this would affect precision by something drastic, say 5%.
And if marginal shifts are troublesome, you can always maintain two
indexes: one with all the document versions for reference if required
and the other one with only the current documents for everyday searches.
Cheers
Rene
Am 16.03.2010 14:05, schrieb TCK:
Wouldn't these excluded/filtered documents skew the scores even though they
are supposed to be marked as deleted? Don't the idf values used in scoring
depend on the entire document set and not just the matching hits for a
query?
Thanks,
TCK
On Tue, Mar 16, 2010 at 5:45 AM, Rene Hackl-Sommer<rene.a.ha...@gmx.de>wrote:
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org