[
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184206#comment-13184206
]
Christian Moen commented on LUCENE-3305:
----------------------------------------
The middle dot character (nakaguro) is treated as character class SYMBOL in
order to provoke a split. This is by design and we override IPADIC in this
case since we feel the split behaviour is more reasonable for most applications.
Having said this, I'd expect input
{noformat}
私がエドガー・ドガです。
{noformat}
to produce segmentation
{noformat}
私 が エドガー ・ ドガ です 。
{noformat}
The middle dot ・ seems to have been removed in your case. Are you deliberately
removing it somewhere?
You're right about the NFKC-normalization. It's turned off by default in the
Kuromoji on Github. I think disabling this is a reasonable default, but I
think it's a good idea to have the option of doing NFKC-normalization prior to
segmentation in the Tokenizer/Analyzer (Lucene).
> Kuromoji code donation - a new Japanese morphological analyzer
> --------------------------------------------------------------
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Christian Moen
> Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch,
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml,
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz,
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese
> morphological analyzer to the Apache Software Foundation in the hope that it
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality,
> actively maintained and easy-to-use Java-based Japanese morphological
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search,
> which we hope will interest Lucene and Solr users. Compound-nouns, such as
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are
> segmented as one token with most analyzers. As a result, a search for 空港
> (airport) or 新聞 (newspaper) will not give you a for in these words. Kuromoji
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it
> compatible with other Apache Software Foundation software to maximize its
> usefulness. Kuromoji has an Apache License 2.0 and all code is currently
> owned by Atilika Inc. The software has been developed by my good friend and
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very
> much like to start the code grant process. I'm also happy to provide patches
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this. Thank you.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]