I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side.
David,
This looks great! Thanks for doing this.
Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's become the industry standard since QueryParser was first written. Also, any chance we can get explanations for hits?
It is difficult to decipher what's doing what. I think we should separately evaluate query formulation and boosting from changes to tf/idf.
We ought to first compare searching body only, ignoring titles, then subsequently try different query formulations over multiple fields with a fixed weighting algorithm. Yes, ignoring titles when searching wikipedia might not be the best approach, but the point is not to over-optimize for wikipedia but rather to find algorithms that work well with general text collections. Removing titles makes the problem harder, which should in turn make it easier to see deficiencies.
Simpler yet, we ought to first try body-only with no proximity, just AND, in order to select good tf/idf formulations. Then we should add auto-proximity searching into the mix, and finally add multiple fields. Does this make sense?
MultiFieldQueryParser is known to be deficient. A better general-purpose multi-field query formulator might be like that used by Nutch. It would translate a query "t1 t2" given fields f1 and f2 into something like:
+(f1:t1^b1 f2:t1^b2) +(f2:t1^b1 f2:t2^b2) f1:"t1 t2"~s1^b3 f2:"t1 t2"~s2^b4
Where b1 and b2 are boosts for f1 and f2, and b3 and b4 are boosts for phrase matching in f1 and f2, and s1 and s2 are slop for f1 and f2. We'd really only need to vary b1 and b3, and could use 1.0 for b2 and b4 and infinity for s1 and s2.
Do folks agree that this is a good general formulation? If so, would someone like to contribute a version of MultiFieldQueryParser that implements this? The API should probably be something like:
static Query parse(String queryString, String[] fields, float[] boolBoosts, float[] phraseBoosts, int[] slops);
A simplified version might be:
static Query parse(String queryString, String[] fields, float[] boosts);
This could use infinity for slops and assume boolBoosts[i] == phraseBoosts[i].
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]