RE: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Chuck Williams Thu, 21 Oct 2004 12:55:11 -0700

Daniel,

I haven't yet dealt with multiple indices, but will in the not-too-distant future, so 
this sounds like a problem that will also be important to me.  I just briefly read 
through the relevant code (e.g., MultiSearcher) to try to understand the issue.  My 
guess is the problem arises from the fact that the separate indices have separately 
computed their tf's and idf's.  This would imply that the searches against each index 
are completely separate searches.  Since the current scoring does not produce scores 
that are comparable across separate searches, the resorting of the hits in 
MultiSearcher.search() via the HitQueue would not accomplish its intended effect.  
This would lead to an incorrect final ranking.  Is that the problem you are actually 
seeing?  If I've got it right, then yes, I believe what I'm proposing will fix this 
too since it would make the scores coming back from the searches against the separate 
indices directly comparable, causing the interleaving in MultiSearcher.search() to 
work properly.

However, I'm not sure this analysis is completely correct due to 
MultiSearcher.docFreq() which appears to be trying to redefine the tf's to be the 
global value across all indices.  It wasn't clear to me how this code is ever reached, 
e.g. from TermQuery --> SegmentTermDocs.  If the tf's and idf's are in fact computed 
globally, then the interleaving should work as it is, thus I'm guessing they are not.

This raises the question of the desired semantics.  Computing the tf's and idf's 
globally seems right for apps that use multiple indices strictly for scalability 
reasons, while issuing separate searches with properly-comparable but separate scoring 
on each seems right for meta-search.  If the scalability case isn't working right 
(i.e., if MultiSeacher is not computing the tf's and idf's across the entire 
collection of indices), fixing it would require a different approach than what I've 
proposed.

If I've missed the actual problem entirely, please let me know.

Thanks,

Chuck

  > -----Original Message-----
  > From: Daniel Naber [mailto:[EMAIL PROTECTED]
  > Sent: Thursday, October 21, 2004 11:33 AM
  > To: Lucene Developers List
  > Subject: Re: Normalized Scoring -- was RE: idf and explain(), was Re:
  > Search and Scoring
  > 
  > On Thursday 21 October 2004 20:00, Chuck Williams wrote:
  > 
  > > Thanks Otis.  Other than trying to get some consensus a) that this is
  > a
  > > problem worth fixing, and b) on the best approach to fix it, my
  > central
  > > question is, if I fix it is it likely to get incorporated back into
  > > Lucene?
  > 
  > Chuck,
  > 
  > sorry, I also lack the time and knowledge to follow this discussion, but
  > what I consider a problem is that you currently cannot search over
  > several
  > indices without getting an incorrect ranking (except these indices were
  > built from splitting one large index). Is that also something you're
  > trying to solve?
  > 
  > Regards
  >  Daniel
  > 
  > --
  > http://www.danielnaber.de
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Reply via email to