RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Mon, 31 Jan 2005 14:19:25 -0800

Doug Cutting wrote:
  > Here's a three term query:
  > 
  > +(f1:t1^b1 f2:t1^b2)
  > +(f1:t2^b1 f2:t2^b2)
  > +(f1:t3^b1 f2:t3^b2)
  > f1:"t1 t2 t3"~s1^b3
  > f2:"t1 t2 t3"~s2^b4

That expansion is scalable, but it only accounts for proximity of all
query terms together.  E.g., it does not favor a match where t1 and t2
are close together while t3 is distant over a match where all 3 terms
are distant.  Worse, it would not favor a match with t1 and t2 in a
short title, and t2 and t3 proximal in the content (with no occurrence
of t1 in the content) vs. a match with t1 and t2 in the title and t2 and
t3 distant in the content.

  > I'm not sure what you mean by scalable.

A more complete expansion might be something like this (although I think
it is still suboptimal):

+(f1:t1^b1 f2:t1^b2)
+(f1:t2^b1 f2:t2^b2)
+(f1:t3^b1 f2:t3^b2)
f1:"t1 t2 t3"~s1^b3
f2:"t1 t2 t3"~s2^b4
f1:"t1 t2"~s3^b5
f1:"t2 t3"~s3^b6
f1:"t1 t3"~s3^b7
f2:"t1 t2"~s3^b8
f2:"t2 t3"~s3^b9
f2:"t1 t3"~s3^b10

b5 thru b10 would be computed based on some function of the individual
field and term boosts.

That is what I meant by unscalable and quadratic in the number of query
terms.

  > Is that distinct from my goal to develop an improved
  > MultiFieldQueryParser for Lucene 2.0?

Not distinct, but I think the first step is to decide on the expansion
we want.  Unless somebody has a better idea, I think the best solution
is a new Query class that simultaneously supports multiple fields, term
diversity and term proximity.  It would be similar to SpansQuery, but
generalized.  It would be like BooleanQuery in the sense that individual
query clauses could be required or not.  Then, default AND could be
achieved by expanding queries to all-required.

With this new Query class, revised versions of QueryParser and
MultiFieldQuery parser would generate it.

Am I way off-base somewhere and/or is there a simpler approach to the
same end?

  > If we want to change the way idf is used, is there a reason we
cannot
  > evaluate that change on its own, then, once that's settled, move on
to
  > the next issue?  We may find that some things cannot be changed in
  > isolation, my guess is that idf and "term diversity" can and should
be
  > discussed separately.

Sure, idf is important enough to evaluate independently as a factor.
However, I do not think these considerations are orthogonal.  For
example, I'm putting a lot of weight in field boosting and don't want
the preference of title matches over body matches to be overwhelmed by
the idf's.  Similarly for the lengthNorm and tf's.  This is why my
Similarity flattens them all considerably more than DefaultSimilarity
does.

With a single field, I still think it makes sense to flatten these more
(e.g., at least eliminate the squaring of idf and the extreme effects of
lengthNorm for very short documents), but perhaps not as much as I do
for the current fields.

With multiple fields, and not desiring to require all query terms in
every match, it is necessary to handle term diversity.

So, the combination of application choices (number and semantics of
fields, default OR vs. default AND, etc.) definitely affects the
optimial tuning of the Similarity.

This all notwithstanding, we can do a single field test, but I may want
to change my Similarity for it.

  > Requiring all query terms is acceptable and even expected by most
  > searchers today.  All of the major web search engines implement
this,
  > and that's where folks learn to search today.

Perhaps, but a lot of enterprise search engines applications do not.
What is best really depends on the app.

Are you sure this is true?  I've heard it stated quite a lot, but it
doesn't conform to my experience with Google.  At least the excerpts
frequently do not include all query terms.  I just tried this bizarre
query:
  hilbert space frank zappa george bush john kerry

There are two hits and they do not appear to have all terms (even in the
cached version of the second hit -- got an error accessing the cache on
the first hit).  This included looking at the page source.

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Monday, January 31, 2005 1:44 PM
  > To: Lucene Developers List
  > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
benchmark
  > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
  > problems with Similarity.docFreq() ?
  > 
  > Chuck Williams wrote:
  > > I think the differences are pretty clear as the systems stands.
  > Notice
  > > a substantial difference in the idf's in the respective
explanations.
  > I
  > > continue to think the current mechanism weights these too high,
  > > primarily due to its squaring.
  > >
  > > The other big difference occurs when all query terms are not
required,
  > > as the current mechanism then does not consider term diversity
(e.g.,
  > t1
  > > in title and in content gets as a good a score as t1 in title and
t2
  > in
  > > content), while the new approach does.
  > 
  > Right.  I'd like to be able to separately discuss such issues and
how to
  > fix them.  Confounding them makes changes to Lucene an
all-or-nothing
  > proposition.  What will be easiest procedurally is to make a series
of
  > uncontroversial, clear improvements to the code, not wholesale
  > replacements.  In the end we may get to the same place, but we'll
still
  > have more people on board.  I don't think a revolution is required,
just
  > some evolution.
  > 
  > If we want to change the way idf is used, is there a reason we
cannot
  > evaluate that change on its own, then, once that's settled, move on
to
  > the next issue?  We may find that some things cannot be changed in
  > isolation, my guess is that idf and "term diversity" can and should
be
  > discussed separately.
  > 
  > >   It would translate a query "t1 t2" given fields f1 and f2
  > > into
  > >   > something like:
  > >   >
  > >   > +(f1:t1^b1 f2:t1^b2)
  > >   > +(f2:t1^b1 f2:t2^b2)
  > >   > f1:"t1 t2"~s1^b3
  > >   > f2:"t1 t2"~s2^b4
  > >
  > > This does not seem scalable.  How do you expand a general query
with n
  > > terms?
  > 
  > Perhaps my example was unclear.  Here's a three term query:
  > 
  > +(f1:t1^b1 f2:t1^b2)
  > +(f1:t2^b1 f2:t2^b2)
  > +(f1:t3^b1 f2:t3^b2)
  > f1:"t1 t2 t3"~s1^b3
  > f2:"t1 t2 t3"~s2^b4
  > 
  > Is that any clearer?
  > 
  > > I sent a not earlier today suggesting that a new Query class is
needed
  > > that simultaneously handles multiple fields, term diversity and
term
  > > proximity.
  > 
  > Is that distinct from my goal to develop an improved
  > MultiFieldQueryParser for Lucene 2.0?
  > 
  > >   > Do folks agree that this is a good general formulation?
  > >
  > > Not unless it is scalable and the desire is to require all query
terms.
  > 
  > I'm not sure what you mean by scalable.
  > 
  > > I would rather not require all query terms, which introduces a
more
  > > complex diversity requirement (ensure that as many distinct query
  > terms
  > > as possible are matched somewhere).
  > 
  > Requiring all query terms is acceptable and even expected by most
  > searchers today.  All of the major web search engines implement
this,
  > and that's where folks learn to search today.
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to