[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Robert Muir (Commented) (JIRA) Tue, 20 Mar 2012 01:08:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233291#comment-13233291
 ]


Robert Muir commented on LUCENE-3888:
-------------------------------------

Koji: hmm I think the problem is not in the Dictionary interface (which is 
actually ok),
but instead in the spellcheckers and suggesters themselves?

For spellchecking, I think we need to expose more Analysis options in 
Spellchecker:
currently this is actually hardcoded at KeywordAnalyzer (it uses NOT_ANALYZED). 
Instead I think you should be able to pass Analyzer: we would also
have a TokenFilter for Japanese that replaces term text with Reading from 
ReadingAttribute.

In the same way, suggest can analyze too. (LUCENE-3842 is already some work for 
that, especially
with the idea to support Japanese this exact same way).

So in short I think we should:
# create a TokenFilter (similar to BaseFormFilter) which copies 
ReadingAttribute into termAtt.
# refactor the 'n-gram analysis' in spellchecker to work on actual tokenstreams 
(this can
  also likely be implemented as tokenstreams), allowing user to set an Analyzer 
on Spellchecker
  to control how it analyzes text.
# continue to work on 'analysis for suggest' like LUCENE-3842.

Note this use of analyzers in spellcheck/suggest is unrelated to Solr's current 
use of 'analyzers' 
which is only for some query manipulation and not very useful.

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well 
> for Japanese environment unfortunately and is the longstanding problem, 
> because the logic needs comparatively long text to check spells, but for some 
> languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off 
> the spell check word and surface form in the spell check dictionary. Then we 
> can use ReadingAttribute for spell checking but CharTermAttribute for 
> suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Reply via email to