[
https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233291#comment-13233291
]
Robert Muir commented on LUCENE-3888:
-------------------------------------
Koji: hmm I think the problem is not in the Dictionary interface (which is
actually ok),
but instead in the spellcheckers and suggesters themselves?
For spellchecking, I think we need to expose more Analysis options in
Spellchecker:
currently this is actually hardcoded at KeywordAnalyzer (it uses NOT_ANALYZED).
Instead I think you should be able to pass Analyzer: we would also
have a TokenFilter for Japanese that replaces term text with Reading from
ReadingAttribute.
In the same way, suggest can analyze too. (LUCENE-3842 is already some work for
that, especially
with the idea to support Japanese this exact same way).
So in short I think we should:
# create a TokenFilter (similar to BaseFormFilter) which copies
ReadingAttribute into termAtt.
# refactor the 'n-gram analysis' in spellchecker to work on actual tokenstreams
(this can
also likely be implemented as tokenstreams), allowing user to set an Analyzer
on Spellchecker
to control how it analyzes text.
# continue to work on 'analysis for suggest' like LUCENE-3842.
Note this use of analyzers in spellcheck/suggest is unrelated to Solr's current
use of 'analyzers'
which is only for some query manipulation and not very useful.
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
> Key: LUCENE-3888
> URL: https://issues.apache.org/jira/browse/LUCENE-3888
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/spellchecker
> Reporter: Koji Sekiguchi
> Priority: Minor
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well
> for Japanese environment unfortunately and is the longstanding problem,
> because the logic needs comparatively long text to check spells, but for some
> languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off
> the spell check word and surface form in the spell check dictionary. Then we
> can use ReadingAttribute for spell checking but CharTermAttribute for
> suggesting, for example.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]