[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833860#action_12833860
 ] 

Eks Dev commented on LUCENE-329:
--------------------------------

{quote}
query for John~ Patitucci~ I'm probably more interested in a partial match on 
the rarer surname than a partial match on the common forename. 
{quote}


as a matter of fact, we have not only one frequency  to consider, rather two 
Term frequencies!

consider simpler case
Query term: "Johan" //would be High frequency term
gives:
Fuzzy Expanded term1 "Johana" // High frequency
Fuzzy Expanded term2 "Joahn" // Low Freq

I guess you would like to score the second term higher, meaning Lower frequency 
(higher IDF)... So far so good. 

Now turn it upside down and search for LF typo "Joahn"... in that case you 
would preffer HF Term "Johan" from expanded list to score higher...

Point being, this situation here is just not "complete" without taking both 
frequencies into consideration (Query Term and Expanded term). In my 
experience, some simple nonlinear hints based on these two freqs bring some 
easy precision points (HF-LF Pairs are much more likely to be typos that two 
HF-HF...  ). 


> Fuzzy query scoring issues
> --------------------------
>
>                 Key: LUCENE-329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-329
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2rc5
>         Environment: Operating System: All
> Platform: All
>            Reporter: Mark Harwood
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to