Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Doug Cutting Tue, 01 Feb 2005 09:42:21 -0800

Chuck Williams wrote:

Doug Cutting wrote:
  > What did you think of my DensityPhraseQuery proposal?

It is a step in the direction of what I have in mind, but I'd like to go
further.  How about a query class with these properties:
  1.  Inputs are:
      a.  F = list of fields
      b.  B = list of field boosts (1:1 correspondence with F)
      c.  T = list of terms or phrases, each either optional or required
      d.  P = proximity-sloping window
  2.  Generate matches that contain every required T in some F, and if
no required T's then at least one optional T if some F.
  3.  Score matches based on these considerations:
      a.  Normal TermQuery and PhraseQuery scores for individual matches
in individual fields.
      b.  Boost scores for proximity of TermQuery and PhraseQuery
matches in individual fields, based on some function of P (term
proximity).
      c.  Boost scores based on number of optional T's matched in at
least one F (term diversity).

That's a lot of functionality bundled into a single Query class! I'd rather make it possible to assemble this from reusable parts. And it almost can be already. Then we can offer such a thing pre-packaged.

So let me take it point-by-point:

1a-c is the new MultiFieldQueryParser implementation.
1d is Similarity.sloppyFreq()
2 is BooleanQuery (except the weird optional stuff)
3a is TermQuery and PhraseQuery
3b is DensityPhraseQuery (to be implemented)
3c is Similarity.coord()

So I think this can be implemented using the expansion I proposed yesterday for MultiFieldQueryParser, plus something like my DensityPhraseQuery and perhaps a few Similarity tweaks.

  > If field boosting needs to then trump idf, we should be able to deal
  > with that when we subsequently tune field boosting, no?  We can,
e.g.,
  > square the field boosts if we need.

Perhaps, but that seems to me to be a hack on top of a hack.  Current
literature seems to consistently not square idf -- I found one reference
that specifically says even Salton removed the squaring after he first
proposed it a long time ago.  The simpler solution is just to remove the
squaring.

I wasn't arguing that we shouldn't alter the idf definition. Precisely the opposite in fact. If squaring idf is bad, then that should show up in single-field search and we can adjust it in that context. You had claimed that good idf formulation is confounded with multi-field search. I do not believe that and that's what I was speaking to. The Salton work you cite is all single-field stuff.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to