I agree that this bug is important to fix, but don't believe we have a solid fix yet. Idf-normalization is essential to get correct for large distributed-index apps. I have a client evaluating Lucene for this now. As Wolf does, I hope a committer with deep knowledge of Lucene's design in this area will weigh in on the issue and help to resolve it.
I've read through Wolf's patch and see a few issues (please correct anything wrong here): 1. DfMapSimilarity works only with a limited set of queries. A complete solution should support all Query types, and certainly must support fundamental Query types like RangeQuery. Could this be addressed by using primitive queries rather than surface queries (i.e., after rewriting)? There may be a more fundamental issue for Query's that generate large numbers of clauses, because it is very inefficient to go access all the RemoteSearchable's for each Term. 2. The patch hardwires the use of DfMapSimilarity into MultiSearcher. As Wolf points out in his comments, this needs to be configurable. At present, it would be impossible to use a custom Similarity, e.g. to change the numerical computation of idf() from the docfreq. The ability to configure custom Similarity's needs to be robust in the presence of MultiSearcher, i.e. an application should be able to make the kinds of changes currently made in a subclass of DefaultSimilarity while inheriting the behavior that makes it work properly with MultiSearcher. 3. Philosophically, I'm not convinced that Similarity's are the right solution. Similarity's are currently used for application-specific scoring customizations. The issue here is idf-normalization in the presence of multiple searchers, which should be an orthogonal consideration. My patch with a topmostSearcher field also has issues, especially the fatal problem that it doesn't work for RemoteSearchable's. A burning question for me is, what is the right solution for RemoteSearchable's? With Wolf's patch, the MultiSearcher analyzes each Query to identify the terms it uses and then calls each RemoteSearchable to get the docFreq's from its index, sums them, extends the Query with a Map of these sums (within a created Similarity), and then passes this information back to the RemoteSearchable's to use during their scoring. An alternative approach would be to precompute the docFreq sums and distribute them to all the RemoteSearchable's ahead of time, independent of Query's. Incremental indexing would need to recompute and propagate the revised sums. Having the sums pre-distributed would make Query-processing efficient. Is something along those lines possible? Chuck > -----Original Message----- > From: Wolf Siberski [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 11, 2005 12:55 AM > To: Lucene Developers List > Subject: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > As I'm very interested in resolving this bug, > I would like to resume the discussion about it. > Chuck Williams (the original bug reporter) and me > both already have provided a patch. Is any of the > committers willing to review them? > If changes are necessary, or another way of handling > this issue turns out to be more appropriate, I would > gladly put more work into that area. > But I need the support of (at least) one committer, and > also IMHO some additional discussion about how to tackle > that issue wouldn't hurt, too. > > --Wolf > > > [EMAIL PROTECTED] wrote: > > DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG* > > RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT > > <http://issues.apache.org/bugzilla/show_bug.cgi?id=31841>. > > ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND* > > INSERTED IN THE BUG DATABASE. > > > > http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 > > > > > > [EMAIL PROTECTED] changed: > > > > What |Removed |Added > > ---------------------------------------------------------------------- > ------ > > CC| |[EMAIL PROTECTED] > > > > > > > > > > ------- Additional Comments From [EMAIL PROTECTED] 2005-01-04 > 23:49 ------- > > *** Bug 32053 has been marked as a duplicate of this bug. *** > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]