[
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707649#action_12707649
]
Michael McCandless commented on LUCENE-1629:
--------------------------------------------
Xiaoping, could you turn the TestSmartChineseAnalyzer into a real JUnit test
case? (Ie, invoke that sample method from the testChineseAnalyzer method)?
Also, it looks like you didn't switch to Class.getResourceAsStream() (Uwe's
suggestion above) -- are you planning on doing that?
Finally, Robert asked a question above (about Big5) that maybe you missed?
bq. Do we compile the source files with a fixed encoding of UTF-8 (build.xml?).
If not, there may be problems, if the Java compiler uses another encoding
(because platform default).
Lucene's common-build.xml already sets the encoding (for javac) to utf-8. So I
think we're good here...
> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
> Reporter: Xiaoping Gao
> Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch,
> LUCENE-1629.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese
> language. it's called "imdict-chinese-analyzer", the project on google code
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am)
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence
> properly, or there will be mis-understandings everywhere in the index
> constructed by Lucene, and the accuracy of the search engine will be affected
> seriously!
> Although there are two analyzer packages in apache repository which can
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or
> every two adjoining characters as a single word, this is obviously not true
> in reality, also this strategy will increase the index size and hurt the
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model
> (HMM), so it can tokenize chinese sentence in a really intelligent way.
> Tokenizaion accuracy of this model is above 90% according to the paper
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to
> contribute it to the apache lucene repository.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]