Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

David Spencer Mon, 31 Jan 2005 14:41:04 -0800

Doug Cutting wrote:

David Spencer wrote:
I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side.
David,
This looks great!  Thanks for doing this.
Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's become the industry standard since QueryParser was first written. Also, any chance we can get explanations for hits?

It is difficult to decipher what's doing what. I think we should separately evaluate query formulation and boosting from changes to tf/idf.

We ought to first compare searching body only, ignoring titles, then

Well a step in the direction of analyzing things step by step is that I now show a 2x2 matrix of search results, each each combo of Similarity an query parser:

http://www.searchmorph.com/kat/wikipedia-similarity.jsp?s=chess+champion

Upper left cell is the pure default case. Bottom right cell is the case of 2 new things (new Similarity, new query parser). The 2 other cells just have 1 "variable changed....see the row/col labels to decipher.

There's no reason I can't also toss in a row for a 3rd query (say, body only), or a 4th (with phrases..) - this is just a step, which I hope doesn't confuse the issue.

The more general form is that for "n" indexes and "m" query parsers we can show a matrix of n cols by m rows...

subsequently try different query formulations over multiple fields with a fixed weighting algorithm. Yes, ignoring titles when searching wikipedia might not be the best approach, but the point is not to over-optimize for wikipedia but rather to find algorithms that work well with general text collections. Removing titles makes the problem harder, which should in turn make it easier to see deficiencies.

Simpler yet, we ought to first try body-only with no proximity, just AND, in order to select good tf/idf formulations. Then we should add auto-proximity searching into the mix, and finally add multiple fields. Does this make sense?

MultiFieldQueryParser is known to be deficient. A better general-purpose multi-field query formulator might be like that used by Nutch. It would translate a query "t1 t2" given fields f1 and f2 into something like:
+(f1:t1^b1 f2:t1^b2)
+(f2:t1^b1 f2:t2^b2)
f1:"t1 t2"~s1^b3
f2:"t1 t2"~s2^b4
Where b1 and b2 are boosts for f1 and f2, and b3 and b4 are boosts for phrase matching in f1 and f2, and s1 and s2 are slop for f1 and f2. We'd really only need to vary b1 and b3, and could use 1.0 for b2 and b4 and infinity for s1 and s2.

Do folks agree that this is a good general formulation? If so, would someone like to contribute a version of MultiFieldQueryParser that implements this? The API should probably be something like:
  static Query parse(String queryString,
                     String[] fields,
                     float[] boolBoosts,
                     float[] phraseBoosts,
                     int[] slops);
A simplified version might be:
  static Query parse(String queryString,
                     String[] fields,
                     float[] boosts);
This could use infinity for slops and assume boolBoosts[i] == phraseBoosts[i].
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to