[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

Mark Harwood (JIRA) Thu, 27 Jan 2011 09:03:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987650#action_12987650
 ]


Mark Harwood commented on LUCENE-329:
-------------------------------------

bq.  I think you can safely implement a RewriteMethod to do whatever you want?

Yep, I've got workarounds using FuzzyLikeThis that work for me but have long 
had a general unease about the "out of the box" experience for others.

However things are certainly better than they were when this issue was first 
raised and the main concerns have been addressed.

bq. So FuzzyQuery behaves now more as one would expect

Is it worth explicitly stating those expectations? Mine would be based on these 
principles:
1) IDF is commonly accepted as useful when ranking partial matches of queries 
with multiple optional clauses
2) IDF doesn't stop being useful if one of those clauses just  happens to be a 
term flagged as "fuzzy".

So given a query:    rareWord~ OR commonWord~ 
I would expect an exact match on "rareWord" to rank higher than an exact match 
on "commonWord".
I don't think the current implementation respects this.








> Fuzzy query scoring issues
> --------------------------
>
>                 Key: LUCENE-329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-329
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2rc5
>         Environment: Operating System: All
> Platform: All
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

Reply via email to