[ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221961#comment-13221961
 ] 

Eks Dev commented on LUCENE-3846:
---------------------------------

awesome! FST/A went a long way.

Just a few random toughs, triggered by "... "corrected" suggestion needs to 
have a much higher freq than the "exact match"..." 

Frequency influence is normally slightly more complicated than "only more 
popular", depending on search task user is facing. Only more popular helps if 
we assume user types it wrong and our suggestions dictionary is always right. 
But in cases where you have user who types it correctly, and collection 
contains errors you would cut all documents with "fuzzy". 

What I found works pretty good is considering this problem to be of nearest 
neighbor type. Namely, 
task is to find closest matches to the query. Some are more and some less 
popular. Take for example a case where user types "black dog" and our 
collection contains document "blaKC dog", having frequency of blakc much lower 
than black, "only more popular" would miss this document.

What works out of the box pretty good is comparing frequency of query word and 
"candidate" to some reasonable cut-off and classifying them to "HF"/"LF" 
(high/low frequency) terms. It is based on the fact that typos are normally 
very seldom (if not, they should be treated as synonyms!). So if user types LF 
token, probably fuzzy candidate would be HF, and the other way around. 

But as said, it depends what the task is.    


Next level for "fuzzy *" in Lucene is going into specifying separate costs for 
Inserts/deletes, swaps and transpositions at character(byte) level and 
optionally considering position of edit. This brings precision++ if used 
properly, like in 
- "inserting/deleting silent h should cost less than other letters (thomas vs 
thomas)"  
- "Phonetics, swap "c" <-> "k" is less evil than default"
- "inserting s at the end... bug vs bugs"

Apart from that, I see absolutely nothing more one on earth can do better :)


Sorry again for just shooting around with "wish lists" at you guys, my 
time-schedule really does not permit any serious work in form of patches.     
                
> Fuzzy suggester
> ---------------
>
>                 Key: LUCENE-3846
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3846
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3846.patch
>
>
> Would be nice to have a suggester that can handle some fuzziness (like spell 
> correction) so that it's able to suggest completions that are "near" what you 
> typed.
> As a first go at this, I implemented 1T (ie up to 1 edit, including a 
> transposition), except the first letter must be correct.
> But there is a penalty, ie, the "corrected" suggestion needs to have a much 
> higher freq than the "exact match" suggestion before it can compete.
> Still tons of nocommits, and somehow we should merge this / make it work with 
> analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to