[jira] [Commented] (LUCENE-3846) Fuzzy suggester

Eks Dev (Commented) (JIRA) Sun, 04 Mar 2012 13:27:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222014#comment-13222014
 ]


Eks Dev commented on LUCENE-3846:
---------------------------------

{quote}
feel free to show me evidence they do
{quote}

Even here they help a lot, do not underestimate error model! (as in noisy 
channel, see http://norvig.com/spell-correct.html for a nice overview).

Examples, off the top of my head:
in a case you search for Carin in a set {Karin, Marin, Darin}, (All valid 
names, at edit distance one) you would prefer to see Karin as a highest (to the 
only one) ranked fuzzy suggestion. (close consonants).

Or discount on swap(vowel ,vowel) vs swap(vowel/consonant, consonant). 
Mistaking one vowel for another is more probable than mistaking two consonants 
or consonant and vowel (as long as humans type). 

Books, scanned using OCR have no problems with phonetics, but other...

Context is important, in-word context as part of "error model" (character level 
context, like previous character) but even more important is the context from  
the "language model", that normally dominates. 

I could look for some interesting papers in my archives if you are not 
convinced yet :)
This one is worth reading (http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf), 
tackles, among other things, exactly this topic. 

{quote}
it's easy to use a custom cost matrix. The cost can also be context-dependent 
too (based on past matched characters, though not [easily] future ones).
{quote}
 
Great to hear that!  
prefix based context is the only context at sub-word level I ever used. I doubt 
lookahead brings something. 

                
> Fuzzy suggester
> ---------------
>
>                 Key: LUCENE-3846
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3846
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3846.patch
>
>
> Would be nice to have a suggester that can handle some fuzziness (like spell 
> correction) so that it's able to suggest completions that are "near" what you 
> typed.
> As a first go at this, I implemented 1T (ie up to 1 edit, including a 
> transposition), except the first letter must be correct.
> But there is a penalty, ie, the "corrected" suggestion needs to have a much 
> higher freq than the "exact match" suggestion before it can compete.
> Still tons of nocommits, and somehow we should merge this / make it work with 
> analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3846) Fuzzy suggester

Reply via email to