[jira] Issue Comment Edited: (LUCENE-329) Fuzzy query scoring issues

Mark Harwood (JIRA) Mon, 15 Feb 2010 09:07:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833876#action_12833876
 ]


Mark Harwood edited comment on LUCENE-329 at 2/15/10 5:05 PM:
--------------------------------------------------------------

bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. "Johana" 
is similarity 0.6 while "Joahn" is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.

      was (Author: markh):
    bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. "Johana" 
is similarity 0.6 while "Johana" is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.
  
> Fuzzy query scoring issues
> --------------------------
>
>                 Key: LUCENE-329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-329
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2rc5
>         Environment: Operating System: All
> Platform: All
>            Reporter: Mark Harwood
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-329) Fuzzy query scoring issues

Reply via email to