RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Tue, 01 Feb 2005 18:38:09 -0800

Paul Elschot wrote:
  > An alternative is to make sure all scores are bounded.
  > Then the coordination factor can be implemented in the same bound
  > while preserving the coordination order.


If I understand this, I think more is required.  My normalization
proposal from a couple months ago involved a boost-weighted
term-coverage normalization of the raw scores (i.e., based on coord's
that are boost-weighted).  Raw scores would be bounded in [0.0, 1.0],
unlike now where they are unbounded.  But one also needs a way to
recover from the just the score critical quality information like, for
example, whether or not all terms were matched.  I was hoping to do this
by simple thresholding, e.g. achieve a property like "results with all
terms matched are always in [0.8, 1.0], and results missing a term
always have a score less than 0.8".  I'm not certain whether or not that
property can be obtained, but feel confident that this would yield a
pretty good absolute quality measure in any event.

Chuck

  > -----Original Message-----
  > From: Paul Elschot [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, February 01, 2005 2:20 PM
  > To: lucene-dev@jakarta.apache.org
  > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
benchmark
  > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
  > problems with Similarity.docFreq() ?
  > 
  > Doug,
  > 
  > On Tuesday 01 February 2005 20:05, Doug Cutting wrote:
  > > Chuck Williams wrote:
  > > >   > So I think this can be implemented using the expansion I
  > proposed
  > > >   > yesterday for MultiFieldQueryParser, plus something like my
  > > >   > DensityPhraseQuery and perhaps a few Similarity tweaks.
  > > >
  > > > I don't think that works unless the mechanism is limited to
default-
  > AND
  > > > (i.e., all clauses required).
  > >
  > > Right.  I have repeatedly argued for default-AND.
  > >
  > > > However, I don't see a way to integrate term proximity into that
  > > > expansion.  Specifically, I don't see a way to handle proximity
and
  > > > coverage simultaneously without managing the multiple fields,
field
  > > > boosts and proximity considerations in a single query class.
Whence,
  > > > the proposal for such a class.
  > >
  > > To repeat my three-term, two-field example:
  > >
  > > +(f1:t1^b1 f2:t1^b2)
  > > +(f1:t2^b1 f2:t2^b2)
  > > +(f1:t3^b1 f2:t3^b2)
  > > f1:"t1 t2 t3"~s1^b3
  > > f2:"t1 t2 t3"~s2^b4
  > 
  > Glad to see some more structure in queries.
  > 
  > >
  > > Coverage is handled by the first three clauses.  Each term must
match
  > in
  > > at least one field.  Proximity is boosted by the last two clauses:
  > when
  > > terms occur close together, the score is increased.  The
  > implementation
  > > of the ~ operator could be improved, as I proposed.
  > >
  > > > Do you see a way to do that?  I.e., do you see a scalable
expansion
  > that
  > > > addresses all the issues for both default-or and default-and?
  > >
  > > I am not really very interested in default-OR.  I think there are
good
  > > reasons that folks have gravitated towards default-AND.  I would
  > prefer
  > > we focus on a good default-AND solution for now.
  > >
  > > If one wishes to rank things by coordination first, and then by
score,
  > > as an improved default-OR, then one needs more than just
score-based
  > > ranking.  Trying to concoct scores that alone guarantee such a
ranking
  > > is very fragile.  In general, one would need a HitCollector API
that
  > > takes both the coord and the score.  This is possible, but I'm not
in
  > a
  > > hurry to implement it.
  > 
  > An alternative is to make sure all scores are bounded.
  > Then the coordination factor can be implemented in the same bound
  > while preserving the coordination order.
  > 
  > >
  > > Lucene's development is constrained.  We want to improve  Lucene,
to
  > > make search results better, to make it faster, and add needed
features,
  > > but we must at the same time keep it back-compatible, maintainable
and
  > > easy-to-use.  The smaller the code, the easier it is to maintain
and
  > > understand, so, e.g., a change that adds a lot of new code is
harder
  > to
  > > accept than one that just tweaks existing code a bit.  We are
changing
  > > many APIs for Lucene 2.0, but we're also providing a clear
migration
  > > path for Lucene 1.X users.  When we add a new, improved API we
must
  > > deprecate the API it replaces and make sure that the new API
supports
  > > all the features of the old API.  We cannot afford to maintain
  > multiple
  > > implementations of similar functionality.  So, for these reasons,
I am
  > > not comfortable simply comitting your
  > DistributingMultiFieldQueryParser
  > > and MaxDisjunctionQuery.  We need to fit these into Lucene, figure
out
  > > what they replace, etc.  Otherwise Lucene could just become a
  > > hodge-podge of poorly maintained classes.  If we think these or
  > > something like them do a better job, then we'd like it to be
natural
  > for
  > > folks upgrading to start using them in favor of old methods, so
that,
  > > long term, we don't have to maintain both.  So the problem is not
  > simply
  > > figuring out what a better default ranking algorithm is, it is
also
  > > figuring out how to sucessfully integrate such an algorithm into
  > Lucene.
  > 
  > MaxDisjunctionQuery could be made to fit with the new
  > DisjunctionSumScorer.
  > It's a bit of work, but straightforward.
  > 
  > For the DistributingMultiFieldQueryParser you already gave a
dissection,
  > iirc.
  > 
  > >
  > > > I think
  > > > the query class I've proposed does that, and should be no more
  > complex
  > > > than the current SpanQuery mechanism, for example.
  > >
  > > The SpanQuery mechanism is quite complex and permits matching of a
  > > completely different sort: fragments rather than whole documents.
  > What
  > > you're proposing does not seem so radically different that it
cannot
  > be
  > > part of the normal document-matching mechansim.
  > >
  > > > Also, it should be
  > > > more efficient than a nested construction of more primitive
  > components
  > > > since it can be directly optimized.
  > >
  > > It might use a bit less CPU, but would not reduce i/o.  My
proposal
  > > processes TermDocs twice, but since Lucene processes query terms
in
  > > parallel, and with filesystem caching, no extra i/o will be
performed.
  > 
  > Even the double TermDocs processing might be fixed later by a
special
  > purpose scorer.
  > 
  > Regards,
  > Paul Elschot
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to