Thanks for the suggestions, Paul.

I've just tried a scheme using the max docFreq of the expanded terms as the docFreq shared by all expanded terms in their idf calculations (giving a lower, shared, IDF) and I'm still removing the coordination factor on the BooleanQuery that groups the term queries..
Results seem much more sensible than the existing way of handling fuzzy queries. Here are some example results:


Query: smith~
==============
New scheme top result: Smith Smith
New scheme top score: 1.0
Existing scheme top result: Smita Khurana
Existing scheme top score: 0.02


Query: pete~ smith~ ============== New Scheme top result: Peter Smith New Scheme top score: 0.99 Existing Scheme top result: Morrissey Pete Existing Scheme top score: 0.07

Query: David Harland~
==============
New scheme top result: David Harland
New scheme top score: 0.68
Existing scheme top result: David Burland
Existing scheme top score: 0.18


I've currently amended FuzzyQuery to create new subclasses of BooleanQuery and TermQuery which override the similarity methods coord (for BooleanQuery) and idf ( for TermQuery). This approach will need to be taken by the other multi-term queries.
Does this sound like the best way to do this?


Cheers
Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to