[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Robert Muir (Updated) (JIRA) Sun, 25 Mar 2012 08:47:51 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-3888:
--------------------------------

    Attachment: LUCENE-3888.patch

I updated the patch and fixed Koji's test, its passing BUT there is a nocommit:
{code}
// nocommit: we need to fix SuggestWord to separate surface and analyzed forms.
// currently the 're-rank' is based on the surface forms!
spellChecker.setAccuracy(0F);
{code}

To explain with the Japanese case how the patch currently works, the 
spellchecker has two phases:
* Phase 1: n-gram approximation phase. Here we generate a n-gram boolean query 
on the Readings. This is working fine.
* Phase 2: re-rank phase. Here we take the candidates from Phase 1 and do a 
real comparison (e.g. Levenshtein) to give them the final score. The problem is 
this currently uses surface form!

I think phase 2 should re-rank based on the 'analyzed form' too? Inside 
spellchecker itself, I don't think this is very difficult, when analyzed != 
surface, we just store it for later retrieval.

The problem is the spellcheck comparison APIs such as SuggestWord don't even 
have any getters or setters and present no way for me to migrate to 
surface+analyzed in any backwards compatible way...

I'll think about this in the meantime. Maybe we should just break and cleanup 
these APIs since its a contrib module and they are funky? 

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, 
> LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well 
> for Japanese environment unfortunately and is the longstanding problem, 
> because the logic needs comparatively long text to check spells, but for some 
> languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off 
> the spell check word and surface form in the spell check dictionary. Then we 
> can use ReadingAttribute for spell checking but CharTermAttribute for 
> suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Reply via email to