Doug, On Tuesday 01 February 2005 20:05, Doug Cutting wrote: > Chuck Williams wrote: > > > So I think this can be implemented using the expansion I proposed > > > yesterday for MultiFieldQueryParser, plus something like my > > > DensityPhraseQuery and perhaps a few Similarity tweaks. > > > > I don't think that works unless the mechanism is limited to default-AND > > (i.e., all clauses required). > > Right. I have repeatedly argued for default-AND. > > > However, I don't see a way to integrate term proximity into that > > expansion. Specifically, I don't see a way to handle proximity and > > coverage simultaneously without managing the multiple fields, field > > boosts and proximity considerations in a single query class. Whence, > > the proposal for such a class. > > To repeat my three-term, two-field example: > > +(f1:t1^b1 f2:t1^b2) > +(f1:t2^b1 f2:t2^b2) > +(f1:t3^b1 f2:t3^b2) > f1:"t1 t2 t3"~s1^b3 > f2:"t1 t2 t3"~s2^b4
Glad to see some more structure in queries. > > Coverage is handled by the first three clauses. Each term must match in > at least one field. Proximity is boosted by the last two clauses: when > terms occur close together, the score is increased. The implementation > of the ~ operator could be improved, as I proposed. > > > Do you see a way to do that? I.e., do you see a scalable expansion that > > addresses all the issues for both default-or and default-and? > > I am not really very interested in default-OR. I think there are good > reasons that folks have gravitated towards default-AND. I would prefer > we focus on a good default-AND solution for now. > > If one wishes to rank things by coordination first, and then by score, > as an improved default-OR, then one needs more than just score-based > ranking. Trying to concoct scores that alone guarantee such a ranking > is very fragile. In general, one would need a HitCollector API that > takes both the coord and the score. This is possible, but I'm not in a > hurry to implement it. An alternative is to make sure all scores are bounded. Then the coordination factor can be implemented in the same bound while preserving the coordination order. > > Lucene's development is constrained. We want to improve Lucene, to > make search results better, to make it faster, and add needed features, > but we must at the same time keep it back-compatible, maintainable and > easy-to-use. The smaller the code, the easier it is to maintain and > understand, so, e.g., a change that adds a lot of new code is harder to > accept than one that just tweaks existing code a bit. We are changing > many APIs for Lucene 2.0, but we're also providing a clear migration > path for Lucene 1.X users. When we add a new, improved API we must > deprecate the API it replaces and make sure that the new API supports > all the features of the old API. We cannot afford to maintain multiple > implementations of similar functionality. So, for these reasons, I am > not comfortable simply comitting your DistributingMultiFieldQueryParser > and MaxDisjunctionQuery. We need to fit these into Lucene, figure out > what they replace, etc. Otherwise Lucene could just become a > hodge-podge of poorly maintained classes. If we think these or > something like them do a better job, then we'd like it to be natural for > folks upgrading to start using them in favor of old methods, so that, > long term, we don't have to maintain both. So the problem is not simply > figuring out what a better default ranking algorithm is, it is also > figuring out how to sucessfully integrate such an algorithm into Lucene. MaxDisjunctionQuery could be made to fit with the new DisjunctionSumScorer. It's a bit of work, but straightforward. For the DistributingMultiFieldQueryParser you already gave a dissection, iirc. > > > I think > > the query class I've proposed does that, and should be no more complex > > than the current SpanQuery mechanism, for example. > > The SpanQuery mechanism is quite complex and permits matching of a > completely different sort: fragments rather than whole documents. What > you're proposing does not seem so radically different that it cannot be > part of the normal document-matching mechansim. > > > Also, it should be > > more efficient than a nested construction of more primitive components > > since it can be directly optimized. > > It might use a bit less CPU, but would not reduce i/o. My proposal > processes TermDocs twice, but since Lucene processes query terms in > parallel, and with filesystem caching, no extra i/o will be performed. Even the double TermDocs processing might be fixed later by a special purpose scorer. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]