I cannot comment on the "marked-as-deleted" documents, but for the approach I outlined: this might impact the scores. I prefer to say 'impact' instead of 'skew', because to me 'skew' would imply that the original scores are some kind of ideal state which is distorted. I don't think this is necessarily the case with term weight shifts.

It really depends on the specific setup. If there are millions of documents in the index, and some of them are in there ten times and others a hundred times in terms of their contribution to statistical figures (not real physical multiple instances), I don't think this would lead to a significant change overall. With a large index, I would be surprised if this would affect precision by something drastic, say 5%.

And if marginal shifts are troublesome, you can always maintain two indexes: one with all the document versions for reference if required and the other one with only the current documents for everyday searches.

Cheers
Rene

Am 16.03.2010 14:05, schrieb TCK:
Wouldn't these excluded/filtered documents skew the scores even though they
are supposed to be marked as deleted? Don't the idf values used in scoring
depend on the entire document set and not just the matching hits for a
query?

Thanks,
TCK




On Tue, Mar 16, 2010 at 5:45 AM, Rene Hackl-Sommer<rene.a.ha...@gmx.de>wrote:



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to