[ https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221961#comment-13221961 ]
Eks Dev commented on LUCENE-3846: --------------------------------- awesome! FST/A went a long way. Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher freq than the "exact match"..." Frequency influence is normally slightly more complicated than "only more popular", depending on search task user is facing. Only more popular helps if we assume user types it wrong and our suggestions dictionary is always right. But in cases where you have user who types it correctly, and collection contains errors you would cut all documents with "fuzzy". What I found works pretty good is considering this problem to be of nearest neighbor type. Namely, task is to find closest matches to the query. Some are more and some less popular. Take for example a case where user types "black dog" and our collection contains document "blaKC dog", having frequency of blakc much lower than black, "only more popular" would miss this document. What works out of the box pretty good is comparing frequency of query word and "candidate" to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It is based on the fact that typos are normally very seldom (if not, they should be treated as synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other way around. But as said, it depends what the task is. Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in - "inserting/deleting silent h should cost less than other letters (thomas vs thomas)" - "Phonetics, swap "c" <-> "k" is less evil than default" - "inserting s at the end... bug vs bugs" Apart from that, I see absolutely nothing more one on earth can do better :) Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really does not permit any serious work in form of patches. > Fuzzy suggester > --------------- > > Key: LUCENE-3846 > URL: https://issues.apache.org/jira/browse/LUCENE-3846 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3846.patch > > > Would be nice to have a suggester that can handle some fuzziness (like spell > correction) so that it's able to suggest completions that are "near" what you > typed. > As a first go at this, I implemented 1T (ie up to 1 edit, including a > transposition), except the first letter must be correct. > But there is a penalty, ie, the "corrected" suggestion needs to have a much > higher freq than the "exact match" suggestion before it can compete. > Still tons of nocommits, and somehow we should merge this / make it work with > analyzing suggester too (LUCENE-3842). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org