Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Paul Elschot Tue, 01 Feb 2005 14:20:04 -0800

Doug, 

On Tuesday 01 February 2005 20:05, Doug Cutting wrote:
> Chuck Williams wrote:
> >   > So I think this can be implemented using the expansion I proposed
> >   > yesterday for MultiFieldQueryParser, plus something like my
> >   > DensityPhraseQuery and perhaps a few Similarity tweaks.
> > 
> > I don't think that works unless the mechanism is limited to default-AND
> > (i.e., all clauses required).
> 
> Right.  I have repeatedly argued for default-AND.
> 
> > However, I don't see a way to integrate term proximity into that
> > expansion.  Specifically, I don't see a way to handle proximity and
> > coverage simultaneously without managing the multiple fields, field
> > boosts and proximity considerations in a single query class.  Whence,
> > the proposal for such a class.
> 
> To repeat my three-term, two-field example:
> 
> +(f1:t1^b1 f2:t1^b2)
> +(f1:t2^b1 f2:t2^b2)
> +(f1:t3^b1 f2:t3^b2)
> f1:"t1 t2 t3"~s1^b3
> f2:"t1 t2 t3"~s2^b4


Glad to see some more structure in queries.

> 
> Coverage is handled by the first three clauses.  Each term must match in 
> at least one field.  Proximity is boosted by the last two clauses: when 
> terms occur close together, the score is increased.  The implementation 
> of the ~ operator could be improved, as I proposed.
> 
> > Do you see a way to do that?  I.e., do you see a scalable expansion that
> > addresses all the issues for both default-or and default-and?
> 
> I am not really very interested in default-OR.  I think there are good 
> reasons that folks have gravitated towards default-AND.  I would prefer 
> we focus on a good default-AND solution for now.
> 
> If one wishes to rank things by coordination first, and then by score, 
> as an improved default-OR, then one needs more than just score-based 
> ranking.  Trying to concoct scores that alone guarantee such a ranking 
> is very fragile.  In general, one would need a HitCollector API that 
> takes both the coord and the score.  This is possible, but I'm not in a 
> hurry to implement it.

An alternative is to make sure all scores are bounded.
Then the coordination factor can be implemented in the same bound
while preserving the coordination order.

> 
> Lucene's development is constrained.  We want to improve  Lucene, to 
> make search results better, to make it faster, and add needed features, 
> but we must at the same time keep it back-compatible, maintainable and 
> easy-to-use.  The smaller the code, the easier it is to maintain and 
> understand, so, e.g., a change that adds a lot of new code is harder to 
> accept than one that just tweaks existing code a bit.  We are changing 
> many APIs for Lucene 2.0, but we're also providing a clear migration 
> path for Lucene 1.X users.  When we add a new, improved API we must 
> deprecate the API it replaces and make sure that the new API supports 
> all the features of the old API.  We cannot afford to maintain multiple 
> implementations of similar functionality.  So, for these reasons, I am 
> not comfortable simply comitting your DistributingMultiFieldQueryParser 
> and MaxDisjunctionQuery.  We need to fit these into Lucene, figure out 
> what they replace, etc.  Otherwise Lucene could just become a 
> hodge-podge of poorly maintained classes.  If we think these or 
> something like them do a better job, then we'd like it to be natural for 
> folks upgrading to start using them in favor of old methods, so that, 
> long term, we don't have to maintain both.  So the problem is not simply 
> figuring out what a better default ranking algorithm is, it is also 
> figuring out how to sucessfully integrate such an algorithm into Lucene.

MaxDisjunctionQuery could be made to fit with the new DisjunctionSumScorer.
It's a bit of work, but straightforward.

For the DistributingMultiFieldQueryParser you already gave a dissection, iirc.

> 
> > I think
> > the query class I've proposed does that, and should be no more complex
> > than the current SpanQuery mechanism, for example.
> 
> The SpanQuery mechanism is quite complex and permits matching of a 
> completely different sort: fragments rather than whole documents.  What 
> you're proposing does not seem so radically different that it cannot be 
> part of the normal document-matching mechansim.
> 
> > Also, it should be
> > more efficient than a nested construction of more primitive components
> > since it can be directly optimized.
> 
> It might use a bit less CPU, but would not reduce i/o.  My proposal 
> processes TermDocs twice, but since Lucene processes query terms in 
> parallel, and with filesystem caching, no extra i/o will be performed.

Even the double TermDocs processing might be fixed later by a special
purpose scorer.
 
Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to