Paul Elschot wrote: > An alternative is to make sure all scores are bounded. > Then the coordination factor can be implemented in the same bound > while preserving the coordination order.
If I understand this, I think more is required. My normalization proposal from a couple months ago involved a boost-weighted term-coverage normalization of the raw scores (i.e., based on coord's that are boost-weighted). Raw scores would be bounded in [0.0, 1.0], unlike now where they are unbounded. But one also needs a way to recover from the just the score critical quality information like, for example, whether or not all terms were matched. I was hoping to do this by simple thresholding, e.g. achieve a property like "results with all terms matched are always in [0.8, 1.0], and results missing a term always have a score less than 0.8". I'm not certain whether or not that property can be obtained, but feel confident that this would yield a pretty good absolute quality measure in any event. Chuck > -----Original Message----- > From: Paul Elschot [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 2:20 PM > To: lucene-dev@jakarta.apache.org > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher > problems with Similarity.docFreq() ? > > Doug, > > On Tuesday 01 February 2005 20:05, Doug Cutting wrote: > > Chuck Williams wrote: > > > > So I think this can be implemented using the expansion I > proposed > > > > yesterday for MultiFieldQueryParser, plus something like my > > > > DensityPhraseQuery and perhaps a few Similarity tweaks. > > > > > > I don't think that works unless the mechanism is limited to default- > AND > > > (i.e., all clauses required). > > > > Right. I have repeatedly argued for default-AND. > > > > > However, I don't see a way to integrate term proximity into that > > > expansion. Specifically, I don't see a way to handle proximity and > > > coverage simultaneously without managing the multiple fields, field > > > boosts and proximity considerations in a single query class. Whence, > > > the proposal for such a class. > > > > To repeat my three-term, two-field example: > > > > +(f1:t1^b1 f2:t1^b2) > > +(f1:t2^b1 f2:t2^b2) > > +(f1:t3^b1 f2:t3^b2) > > f1:"t1 t2 t3"~s1^b3 > > f2:"t1 t2 t3"~s2^b4 > > Glad to see some more structure in queries. > > > > > Coverage is handled by the first three clauses. Each term must match > in > > at least one field. Proximity is boosted by the last two clauses: > when > > terms occur close together, the score is increased. The > implementation > > of the ~ operator could be improved, as I proposed. > > > > > Do you see a way to do that? I.e., do you see a scalable expansion > that > > > addresses all the issues for both default-or and default-and? > > > > I am not really very interested in default-OR. I think there are good > > reasons that folks have gravitated towards default-AND. I would > prefer > > we focus on a good default-AND solution for now. > > > > If one wishes to rank things by coordination first, and then by score, > > as an improved default-OR, then one needs more than just score-based > > ranking. Trying to concoct scores that alone guarantee such a ranking > > is very fragile. In general, one would need a HitCollector API that > > takes both the coord and the score. This is possible, but I'm not in > a > > hurry to implement it. > > An alternative is to make sure all scores are bounded. > Then the coordination factor can be implemented in the same bound > while preserving the coordination order. > > > > > Lucene's development is constrained. We want to improve Lucene, to > > make search results better, to make it faster, and add needed features, > > but we must at the same time keep it back-compatible, maintainable and > > easy-to-use. The smaller the code, the easier it is to maintain and > > understand, so, e.g., a change that adds a lot of new code is harder > to > > accept than one that just tweaks existing code a bit. We are changing > > many APIs for Lucene 2.0, but we're also providing a clear migration > > path for Lucene 1.X users. When we add a new, improved API we must > > deprecate the API it replaces and make sure that the new API supports > > all the features of the old API. We cannot afford to maintain > multiple > > implementations of similar functionality. So, for these reasons, I am > > not comfortable simply comitting your > DistributingMultiFieldQueryParser > > and MaxDisjunctionQuery. We need to fit these into Lucene, figure out > > what they replace, etc. Otherwise Lucene could just become a > > hodge-podge of poorly maintained classes. If we think these or > > something like them do a better job, then we'd like it to be natural > for > > folks upgrading to start using them in favor of old methods, so that, > > long term, we don't have to maintain both. So the problem is not > simply > > figuring out what a better default ranking algorithm is, it is also > > figuring out how to sucessfully integrate such an algorithm into > Lucene. > > MaxDisjunctionQuery could be made to fit with the new > DisjunctionSumScorer. > It's a bit of work, but straightforward. > > For the DistributingMultiFieldQueryParser you already gave a dissection, > iirc. > > > > > > I think > > > the query class I've proposed does that, and should be no more > complex > > > than the current SpanQuery mechanism, for example. > > > > The SpanQuery mechanism is quite complex and permits matching of a > > completely different sort: fragments rather than whole documents. > What > > you're proposing does not seem so radically different that it cannot > be > > part of the normal document-matching mechansim. > > > > > Also, it should be > > > more efficient than a nested construction of more primitive > components > > > since it can be directly optimized. > > > > It might use a bit less CPU, but would not reduce i/o. My proposal > > processes TermDocs twice, but since Lucene processes query terms in > > parallel, and with filesystem caching, no extra i/o will be performed. > > Even the double TermDocs processing might be fixed later by a special > purpose scorer. > > Regards, > Paul Elschot > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]