[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Uwe Schindler (JIRA) Fri, 18 Oct 2013 01:12:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798894#comment-13798894
 ]


Uwe Schindler commented on LUCENE-4956:
---------------------------------------

Hi,
I reviewed the Trie.java code and its usage yesterday. Trie.java is only used 
at 2 places with same usage pattern:
- DictionaryUtil#dictionary
- Tagger#occurences

In both cases there are only 2 types of matches:
- DictionaryUtil#findWithPrefix: returns an Iterator of all entries with a 
given prefix
- DictionaryUtil#getWord: returns WordEntry for an exact match
- Tagger#getGR: returns iterator of all entries with a given prefix

These use cases are not really the ones a Trie is made for, so the ideal and 
most performant would be to USE Lucene's FST implementation. We would also get 
an Iterator-like interface to look up prefixes. So I would suggest to replace 
these 3 methods by an FST backing them. The dictionary would then (like for 
kuromoji) be preprocessed and saved as serialized FST in the resource file. The 
original dictionary as text file would only be available in the Lucene source 
distribution to regenerate the FST.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-4956
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4956
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.2
>            Reporter: SooMyung Lee
>            Assignee: Christian Moen
>              Labels: newbie
>         Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, 
> lucene4956.patch, LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Reply via email to