RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Mon, 31 Jan 2005 13:22:01 -0800

Doug Cutting wrote:
  > Is the default operator AND or OR?  It appears to be OR, but it
should
  > probably be AND.  That's become the industry standard since
QueryParser
  > was first written.  Also, any chance we can get explanations for
hits?


Explanations are available.  Click the score link on a result.

  > It is difficult to decipher what's doing what.  I think we should
  > separately evaluate query formulation and boosting from changes to
  > tf/idf.

Earlier I proposed the opposite as my mechanism is designed to work in
concert:  i.e., the Similarity and the query parsing work together.
Most real collections have at least title and body fields.  We decided
to look at the combined structure and compare results, then dig into
individual details as appropriate to understand the results.

The analysis can be approached bottom-up, a factor at a time, or top
down, looking at two complete formulations and then dissecting them to
further understand their differences.

I think the differences are pretty clear as the systems stands.  Notice
a substantial difference in the idf's in the respective explanations.  I
continue to think the current mechanism weights these too high,
primarily due to its squaring.

The other big difference occurs when all query terms are not required,
as the current mechanism then does not consider term diversity (e.g., t1
in title and in content gets as a good a score as t1 in title and t2 in
content), while the new approach does.

  > MultiFieldQueryParser is known to be deficient.  A better
  > general-purpose multi-field query formulator might be like that used
by
  > Nutch. It would translate a query "t1 t2" given fields f1 and f2
into
  > something like:
  > 
  > +(f1:t1^b1 f2:t1^b2)
  > +(f2:t1^b1 f2:t2^b2)
  > f1:"t1 t2"~s1^b3
  > f2:"t1 t2"~s2^b4

This does not seem scalable.  How do you expand a general query with n
terms?  I believe Dave has some code that generates all the pairwise
combinations, but this is quadratic in the length of the query and it
doesn't consider proximity of larger collections of query terms.

I sent a not earlier today suggesting that a new Query class is needed
that simultaneously handles multiple fields, term diversity and term
proximity.

  > Do folks agree that this is a good general formulation?

Not unless it is scalable and the desire is to require all query terms.
I would rather not require all query terms, which introduces a more
complex diversity requirement (ensure that as many distinct query terms
as possible are matched somewhere).

I'm interested in solving this problem and would be happy to contribute
whatever I write.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to