Chris Hostetter <[EMAIL PROTECTED]> wrote on 12/04/2007 15:22:20:

>
> : But not which terms have an odd IDF value because of those deleted
> : documents.  How much does the IDF value contribute to the "score" in
> : search?
>
> all idf's are affected equally, because the 'numDocs" value used is
> allways the same ... it really shouldn't affect the scores from a query,
> it just makes it hard to compare the scores you get from one index reader
> with the scores you get from a new index reader after deleting and
> readding a bunch of documents.

Not sure about the extreme case - assume two words, one common and one
rare, making up an OR query:
   TC  (a very common term)
   TR  (a very rare term)
   IDF(TC) << IDF(TR)
   Q = TC TR
   D1 = document with one occurrence of TR
   D2 = document with three occurrences of TC.
==> D1 is scored higher than D2

But if now the index goes through a massive update, where almost all the
docs containing TC are deleted, and TC is not in any newly added doc,
practically TC becomes rare too, and hence D2 should probably be scored
higher than D1. But IDF(TC) might not (yet) reflect the massive docs
deletion, and the scores are wrongly biased so D1 is still scored higher
than D2.

I didn't follow the code for that, just thinking IDFs and scoring aloud, so
I hope I am not missing something, but in any case this is just for the
sake of discussion, because in reality you don't expect index statistics to
change that dramatically, ahead of merges.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to