Doug Cutting wrote: > That's a lot of functionality bundled into a single Query class! I'd > rather make it possible to assemble this from reusable parts. And it > almost can be already. Then we can offer such a thing pre-packaged.
That would be great, if it could be done. > So let me take it point-by-point: > > 1a-c is the new MultiFieldQueryParser implementation. > 1d is Similarity.sloppyFreq() > 2 is BooleanQuery (except the weird optional stuff) BooleanQuery does support the "weird optional stuff"; these are just BooleanClauses that are neither required nor prohibited. I don't consider that "weird". > 3a is TermQuery and PhraseQuery > 3b is DensityPhraseQuery (to be implemented) > 3c is Similarity.coord() > > So I think this can be implemented using the expansion I proposed > yesterday for MultiFieldQueryParser, plus something like my > DensityPhraseQuery and perhaps a few Similarity tweaks. I don't think that works unless the mechanism is limited to default-AND (i.e., all clauses required). As soon as you support default-OR, then what I've been calling the term diversity problem arises (which might better be called the term coverage problem; i.e., ensure that matching more terms in the query in some field is better than repeatedly matching the same term in different fields). I address the term coverage problem, without consideration of proximity, by using DistributingMultiFieldQueryParser and MaxDisjunctionQuery. These work well, as Dave's example site shows. However, I don't see a way to integrate term proximity into that expansion. Specifically, I don't see a way to handle proximity and coverage simultaneously without managing the multiple fields, field boosts and proximity considerations in a single query class. Whence, the proposal for such a class. Do you see a way to do that? I.e., do you see a scalable expansion that addresses all the issues for both default-or and default-and? I think the query class I've proposed does that, and should be no more complex than the current SpanQuery mechanism, for example. Also, it should be more efficient than a nested construction of more primitive components since it can be directly optimized. I think this could make a substantial improvement to Lucene's relevance ranking. > I wasn't arguing that we shouldn't alter the idf definition. Precisely > the opposite in fact. If squaring idf is bad, then that should show up > in single-field search and we can adjust it in that context. You had > claimed that good idf formulation is confounded with multi-field search. > I do not believe that and that's what I was speaking to. The Salton > work you cite is all single-field stuff. I didn't object to a single-field test. I think my message started by agreeing to that. What I said that is that optimal idf-tuning is a function of the fields and query expansions being used. In general, I believe in tuning relevance ranking per application. In my experience, this makes a huge difference. E.g., Google's relevance ranking works well on the web, but is known to produce poor results in typically link-poor enterprise document repositories (there have been many published comments about this, and I've competed with them directly and demonstrated it to potential customers). Chuck --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]