RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Tue, 01 Feb 2005 10:05:32 -0800

Doug Cutting wrote:
  > That's a lot of functionality bundled into a single Query class!
I'd
  > rather make it possible to assemble this from reusable parts.  And
it
  > almost can be already.  Then we can offer such a thing pre-packaged.


That would be great, if it could be done.

  > So let me take it point-by-point:
  > 
  > 1a-c is the new MultiFieldQueryParser implementation.
  > 1d is Similarity.sloppyFreq()
  > 2 is BooleanQuery (except the weird optional stuff)

BooleanQuery does support the "weird optional stuff"; these are just
BooleanClauses that are neither required nor prohibited.  I don't
consider that "weird".

  > 3a is TermQuery and PhraseQuery
  > 3b is DensityPhraseQuery (to be implemented)
  > 3c is Similarity.coord()
  > 
  > So I think this can be implemented using the expansion I proposed
  > yesterday for MultiFieldQueryParser, plus something like my
  > DensityPhraseQuery and perhaps a few Similarity tweaks.

I don't think that works unless the mechanism is limited to default-AND
(i.e., all clauses required).  As soon as you support default-OR, then
what I've been calling the term diversity problem arises (which might
better be called the term coverage problem; i.e., ensure that matching
more terms in the query in some field is better than repeatedly matching
the same term in different fields).

I address the term coverage problem, without consideration of proximity,
by using DistributingMultiFieldQueryParser and MaxDisjunctionQuery.
These work well, as Dave's example site shows.

However, I don't see a way to integrate term proximity into that
expansion.  Specifically, I don't see a way to handle proximity and
coverage simultaneously without managing the multiple fields, field
boosts and proximity considerations in a single query class.  Whence,
the proposal for such a class.

Do you see a way to do that?  I.e., do you see a scalable expansion that
addresses all the issues for both default-or and default-and?  I think
the query class I've proposed does that, and should be no more complex
than the current SpanQuery mechanism, for example.  Also, it should be
more efficient than a nested construction of more primitive components
since it can be directly optimized.  I think this could make a
substantial improvement to Lucene's relevance ranking.

  > I wasn't arguing that we shouldn't alter the idf definition.
Precisely
  > the opposite in fact.  If squaring idf is bad, then that should show
up
  > in single-field search and we can adjust it in that context.  You
had
  > claimed that good idf formulation is confounded with multi-field
search.
  >   I do not believe that and that's what I was speaking to.  The
Salton
  > work you cite is all single-field stuff.

I didn't object to a single-field test.  I think my message started by
agreeing to that.  What I said that is that optimal idf-tuning is a
function of the fields and query expansions being used.  In general, I
believe in tuning relevance ranking per application.  In my experience,
this makes a huge difference.  E.g., Google's relevance ranking works
well on the web, but is known to produce poor results in typically
link-poor enterprise document repositories (there have been many
published comments about this, and I've competed with them directly and
demonstrated it to potential customers).

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to