[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Christian Moen (JIRA) Sat, 27 Apr 2013 21:48:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643901#comment-13643901
 ]


Christian Moen commented on LUCENE-4956:
----------------------------------------

The Korean analyzer should be named named 
{{org.apache.lucene.analysis.kr.KoreanAnalyzer}} and we'll provide a 
ready-to-use field type {{text_kr}} in {{schema.xml}} for Solr users, which is 
consistent with what we do for other languages.

As for where the analyzer code itself lives, I think it's fine to put it in 
{{lucene/analysis/arirang}}.  The file {{lucene/analysis/README.txt}} documents 
what these modules are and the code is easily and directly retrievable in IDEs 
by looking up {{KoreanAnalyzer}} (the source code paths will be set up by {{ant 
eclipse}} and {{ant idea}}).

One reason analyzers have not been put in {{lucene/analysis/common} in the past 
is that they require dictionaries that are several megabytes large.

Overall, I don't think the scheme we are using is all that problematic, but 
it's true that {{MorfologikAnalyzer}} and {{SmartChineseAnalyzer}} doesn't 
align with it.  The scheme doesn't easily lend itself to different 
implementations for one language, but that's not a common case today although 
it might become more common in the future.

In the case of Norwegian (no), there are ISO language codes for both Bokmål 
(bm) and Nynorsk (nn), and one way of supporting this is also to consider these 
as options to {{NorwegianAnalyzer}} since both languages are Norwegian.  See 
SOLR-4565 for thoughts on how to extend support in 
{{NorwegianMinimalStemFilter}} for this.

A similar overall approach might make sense when there are multiple 
implementations of a language; end-users can use a analyzer named 
{{<Language>Analyzer}} without requiring users to study the difference in 
implementation before using.  I also see problems with this, but it's just a 
thought...

I'm all for improving our scheme, but perhaps we can open up a separate JIRA 
for this and keep this one focused on Korean?




                
> the korean analyzer that has a korean morphological analyzer and dictionaries
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-4956
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4956
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.2
>            Reporter: SooMyung Lee
>              Labels: newbie
>         Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Reply via email to