[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2017-07-22 Thread Shad Storhaug (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097465#comment-16097465
 ] 

Shad Storhaug commented on LUCENE-3305:
---

I am working on porting this code over to .NET (LUCENENET-567). All is good 
with the Analyzer, Filters, etc. since they have good tests.

However, there are only 2 tests for the "tools", and neither one tests the 
{{DictionaryBuilder.Main(String[] args)}} method (or runs any of the I/O code). 
Documentation is scant, and it is difficult to work out what should be in the 
input directory in order to do and end-to-end test of this tool.

Could you add some tests so we have better code coverage, and so I can verify 
it all works after the translation to .NET? Or at least provide a zip file with 
the files that are required as input to the tool?



> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Robert Muir
> Fix For: 3.6, 4.0-ALPHA
>
> Attachments: ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, Kuromoji short overview 
> .pdf, kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, 
> LUCENE-3305.patch, LUCENE-3305.patch, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-15 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186514#comment-13186514
 ] 

Christian Moen commented on LUCENE-3305:


Thanks for excellent work integrating Kuromoji, Robert.  Also thanks to 
everybody who has made helped and made this happen. 

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Robert Muir
> Fix For: 3.6, 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> LUCENE-3305.patch, ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-12 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185222#comment-13185222
 ] 

Robert Muir commented on LUCENE-3305:
-

Yes, thanks also to Uwe for lots of work compressing data and refactoring, and 
Mike for tuning the fsts.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> LUCENE-3305.patch, ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-12 Thread Simon Willnauer (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185215#comment-13185215
 ] 

Simon Willnauer commented on LUCENE-3305:
-

bq. I committed this to trunk.

YAY! thanks everyone!

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> LUCENE-3305.patch, ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-12 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185199#comment-13185199
 ] 

Robert Muir commented on LUCENE-3305:
-

I committed this to trunk. 

I'll let hudson chew on it a bit before backporting to branch 3.x, but in 
general
I think we've hammered on this enough that its ready to be backported too.

Its a big contribution so I'm sure minor things might pop up but we can just 
open new issues...

Big thanks to Christian for the contribution... this is awesome!

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> LUCENE-3305.patch, ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-11 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184243#comment-13184243
 ] 

Robert Muir commented on LUCENE-3305:
-

{quote}
The middle dot ・ seems to have been removed in your case. Are you deliberately 
removing it somewhere?
{quote}

Just in my debugging :)

(Separately: i did add an option to doTokenize to not emit punctuation tokens, 
and the lucene analyzer uses it by default, otherwise
index size and searches are affected by many tokens like "。"... but thats 
unrelated here)

{quote}
You're right about the NFKC-normalization. It's turned off by default in the 
Kuromoji on Github. I think disabling this is a reasonable default, but I think 
it's a good idea to have the option of doing NFKC-normalization prior to 
segmentation in the Tokenizer/Analyzer (Lucene).
{quote}

Yeah i agree, we can add a charfilter that uses the incremental normalization 
api.


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-11 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184206#comment-13184206
 ] 

Christian Moen commented on LUCENE-3305:


The middle dot character (nakaguro) is treated as character class SYMBOL in 
order to provoke a split.  This is by design and we override IPADIC in this 
case since we feel the split behaviour is more reasonable for most applications.

Having said this, I'd expect input

{noformat}
私がエドガー・ドガです。
{noformat}

to produce segmentation
{noformat}
私 が エドガー ・ ドガ です 。
{noformat}

The middle dot ・ seems to have been removed in your case.  Are you deliberately 
removing it somewhere?


You're right about the NFKC-normalization.  It's turned off by default in the 
Kuromoji on Github.  I think disabling this is a reasonable default, but I 
think it's a good idea to have the option of doing NFKC-normalization prior to 
segmentation in the Tokenizer/Analyzer (Lucene).


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-11 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184186#comment-13184186
 ] 

Robert Muir commented on LUCENE-3305:
-

Thank you for fixing that bug!

By the way, I've been reviewing the differences between mecab and kuromoji. In 
general the differences seem fine to me, 
actually in Kuromoji's favor (at least for search). Most revolve around 
middle-dot:

{noformat}
sentence: 私がエドガー・ドガです。
mecab: [私, が, エドガー・ドガ, です]
kuromoji: [私, が, エドガー, ドガ, です]
{noformat}

So I think these are improvements, at least for search (e.g. Kuromoji splits 
the first/last name here).

But, there is often funkiness revolving caused by the normalizeEntries option, 
which, if
an entry is not NFKC-normalized, it adds an NFKC-normalized entry with the same 
costs etc. 

However, I think in some cases this skews the costs because e.g. half-width and 
full-width numbers have different costs.
So by adding normalized entries with the full-width cost, we sometimes get 
worse tokenization.

{noformat}
sentence: Windows95対応のゲームを動かしたいのです。
mecab: [Windows, 95, 対応, の, ゲーム, を, 動かし, たい, の, です]
kuromoji: [Windows, 9, 5, 対応, の, ゲーム, を, 動かし, たい, の, です]
{noformat}

I changed the default locally of 'normalizeEntries' to false and it seemed to 
totally fix this, and all
the differences vs. mecab then seemed positive. 

I think we should disable normalizeEntries by default so that no costs are 
potentially skewed... opinions?


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-11 Thread Uwe Schindler (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183961#comment-13183961
 ] 

Uwe Schindler commented on LUCENE-3305:
---

Committed development branch revision: 1229948
Thanks Christian!

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-11 Thread Uwe Schindler (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183958#comment-13183958
 ] 

Uwe Schindler commented on LUCENE-3305:
---

Hi Christian,
thanks for the fix. I will aply the patch to the branch. The tests 
testYabottai() and testTsukitosha() are not hurting, but have no meaning for 
our variant, because wordid=0 and last wordid have different words (because we 
presort the whole dictionary for the FST). To make the test really use 
wordid=0, I should lookup the actual dictionary entries of first and last word.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz, wordid0.patch
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-03 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179101#comment-13179101
 ] 

Robert Muir commented on LUCENE-3305:
-

{quote}
I've built the branch. I needed to do ant test -Dargs="-Dfile.encoding=UTF-8" 
in order to make all the Kuromoji tests pass as some of them assume UTF-8 file 
encoding. (MacRoman is default on my system.)
{quote}

This sounds like a bug in the build, you shouldn't have to do that (it should 
be set already). However, my default encoding is UTF-8 so thats why i didn't 
catch it. I'll look into this.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-03 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179100#comment-13179100
 ] 

Christian Moen commented on LUCENE-3305:


Thanks, Robert.

I've built the branch.  I needed to do {{ant test 
-Dargs="-Dfile.encoding=UTF-8"}} in order to make all the Kuromoji tests pass 
as some of them assume UTF-8 file encoding. (MacRoman is default on my system.)

I really appreciate the efforts yourself and Simon have put it.  I also hope to 
make some meaningful contributions to make sure Kuromoji integrates and works 
works well with Solr and Lucene.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2012-01-02 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178635#comment-13178635
 ] 

Robert Muir commented on LUCENE-3305:
-

I created a branch here 
(https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3305) 
with an initial import of this code, only minor tweaks to get things working in 
the build so far.



> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-11-14 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149642#comment-13149642
 ] 

Christian Moen commented on LUCENE-3305:


Thanks a lot, Simon!

Robert, I agree completely with your comments.  The Unicode normalization is 
only done at dictionary build time.  Simon has turned it on by default -- its 
previous default was off.  Perhaps it makes sense to have it on in Lucene's 
case...

Simon, the TokenizerRunner class doesn't seem to be included in the patch, 
which might be fine.  It's not strictly necessary for Lucene, but I think it's 
useful to keep it there so the analyzer can easily be run from the command 
line.  The DebugTokenizer and GraphvizFormatter is there already, which aren't 
strictly necessary either, but sometimes quite useful, so I'm think we should 
add the TokenizerRunner as well -- at least for now.

Tests didn't pass in my case, but I'll look more into this soon.  My tomorrow 
is very busy, but I'll have time for this on Wednesday.


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-11-13 Thread Simon Willnauer (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149276#comment-13149276
 ] 

Simon Willnauer commented on LUCENE-3305:
-

+1 to all your comments. For 3.x lets figure this out somewhere else... first 
iterate on trunk and when we have it at a reasonable stage we backport it to 
3.x. The vote succeeded so we are good to go!

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-11-11 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148933#comment-13148933
 ] 

Robert Muir commented on LUCENE-3305:
-

looks like we want to add the Lucene analyzer/tokenizer and solr factories from 
kuromoji-solr-0.5.3-asf.tar.gz

I'd say once we get stuff going, maybe just download the dictionary, build it, 
and when committing commit
the built dictionary under resources/ folder (this is where the script puts it).

I think for this kind of feature it might be hard to iterate with patches, we 
should maybe try to get it 
in SVN (trunk) initially and iterate with smaller issues. The code looks pretty 
clean to me already.

The produced jar file is somewhat large but I think its still reasonable, so I 
think we should look past
this for now? working with Sen before I know some ways we can shrink this a 
lot, but that would be best
on a future issue.

Some java6 apis are here (e.g. unicode normalization). Christian can you 
confirm this is only for the 
dictionary-build stage? It looked to me like its only needed for ipadic/unidic 
parsing, but not
custom dictionary support.

If its only for the build stage, personally I think thats fine for 3.x too, 
because I'm suggesting we 
commit a 'built' dictionary and we tell people if they want to compile the 
dictionary themselves they 
need java6? We could put the dictionary-building under a tools/ directory thats 
java6-only, or we could 
depend on ICU for just the tools/ piece (i think we already have such hacks for 
generating jflex rules
for StandardTokenizer) and be fine on java5.

+1 for the GraphVizFormatter... 


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-11-08 Thread Simon Willnauer (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146591#comment-13146591
 ] 

Simon Willnauer commented on LUCENE-3305:
-

I send the vote to general@incubator ...we will see in 72h! thanks folks

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> ip-clearance-Kuromoji.xml, kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-11-08 Thread Simon Willnauer (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146583#comment-13146583
 ] 

Simon Willnauer commented on LUCENE-3305:
-

I committed the file to the incubator ip-clearance in revision 1199470. I will 
go ahead an call an incubator vote now. thanks grant


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> ip-clearance-Kuromoji.xml, kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-11-08 Thread Grant Ingersoll (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146568#comment-13146568
 ] 

Grant Ingersoll commented on LUCENE-3305:
-

File looks good to me.  You need to check in the file to 
https://svn.apache.org/repos/asf/incubator/public/trunk/site-author/ip-clearance
 and then call a vote on gene...@incubator.apache.org (there should be examples 
of this in the archives for that list).  Vote is lazy consensus, so don't 
expect too much feedback.  Once that vote passes, then the code can be 
committed.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> ip-clearance-Kuromoji.xml, kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-11-03 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143187#comment-13143187
 ] 

Robert Muir commented on LUCENE-3305:
-

Just a ping... whats our next step?

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> ip-clearance-Kuromoji.xml, kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-09-22 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112410#comment-13112410
 ] 

Christian Moen commented on LUCENE-3305:


Thanks for the follow-up, Robert and Simon.  I've started working on a patch.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-09-21 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109356#comment-13109356
 ] 

Simon Willnauer commented on LUCENE-3305:
-

According to LEGAL-97 we can include the dict files. That means we can finish 
this code donation and get everything in shape for a commit. I will finish the 
paper work once I am back from traveling.



> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-09-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109062#comment-13109062
 ] 

Robert Muir commented on LUCENE-3305:
-

Now that we have some feedback on LEGAL-97, what is the next step we need to do 
to move forward with this feature?

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-17 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086186#comment-13086186
 ] 

Simon Willnauer commented on LUCENE-3305:
-

FYI - I created an issue on legal to categorize the IPADIC license LEGAL-97

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-09 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081540#comment-13081540
 ] 

Christian Moen commented on LUCENE-3305:


Correct.  You should definitely check this with legal.  I've tried to point 
this out in the description and in my email with the secretary as well.  If 
there are questions or concerns my legal counsel can possibly assist, but I 
guess this is something the ASF has to consider by itself.


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081532#comment-13081532
 ] 

Simon Willnauer commented on LUCENE-3305:
-

bq. Please see NOTICE.txt for information on the dictionaries.
so those dictionaries are not ASL licensed, right? I need to check with legal 
if we can include them into our distribution at all so we need to figure that 
out first. 

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-09 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081526#comment-13081526
 ] 

Christian Moen commented on LUCENE-3305:


Please see {{NOTICE.txt}} for information on the dictionaries.

Kindly let me know which files that require a license header and how I should 
proceed to provide a revised version.  Do you prefer a complete tarball or can 
I attach the filed individually to this JIRA?

Thanks!


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081522#comment-13081522
 ] 

Simon Willnauer commented on LUCENE-3305:
-

Christian, I see a couple of files in the resource folders that don't have a 
license header, we need to make sure that all files do have an ASL 2 license 
header before we can finish the IP clearance process. Yet, I don't know much 
about this segmenter but I guess it works based on a dictionary, no? If so 
where are the dictionary files since I only see resource files in the test 
folder but maybe I miss something?

simon

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080812#comment-13080812
 ] 

Simon Willnauer commented on LUCENE-3305:
-

Christina, thanks for filing the paper work, I just called out a vote on 
dev@l.a.o hope to get this done soon!

simon

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-08-01 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076069#comment-13076069
 ] 

Christian Moen commented on LUCENE-3305:


Hello again, Simon.  I've filed the paperwork and copied you on email.  Hope 
you're enjoying your vacation!

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-22 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069481#comment-13069481
 ] 

Christian Moen commented on LUCENE-3305:


Hello Simon.  I'll file the paperwork over the next couple of days by email and 
copy you.  Have a brilliant vacation! :)

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-22 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069472#comment-13069472
 ] 

Simon Willnauer commented on LUCENE-3305:
-

I am going to be away for 2 weeks if somebody wants to continue driving this 
code grant. please do. Otherwise @christian sorry for the break I will continue 
once I am back or here and there if I find a computer :)

simon

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-20 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068814#comment-13068814
 ] 

Simon Willnauer commented on LUCENE-3305:
-

Christian, apparently we just handle this as the CLA. You fill it out, scan it 
and send it to secret...@apache.org. Make sure you use the ICLA details when 
you file it.

let me know once you those are send.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-20 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068766#comment-13068766
 ] 

Christian Moen commented on LUCENE-3305:


Hello again, Simon. Has there been any update as to where I should send the 
code grant? Many thanks.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-17 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066786#comment-13066786
 ] 

Christian Moen commented on LUCENE-3305:


Thanks, Simon.  Please let me know where I should send the code grant and I'll 
file the paperwork.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-15 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066210#comment-13066210
 ] 

Simon Willnauer commented on LUCENE-3305:
-

koji, I took the issue until the code grant is due etc.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Simon Willnauer
> Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-13 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064422#comment-13064422
 ] 

Christian Moen commented on LUCENE-3305:


Please let me know if you need paperwork from me to follow up on this.  Thanks 
again.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Koji Sekiguchi
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6-asf.tar.gz, 
> kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063955#comment-13063955
 ] 

Christian Moen commented on LUCENE-3305:


久しぶりですよね。 Thanks a lot, Koji. :)

I completely agree.  If we can get Kuromoji into the codebase, I'm more than 
happy to submit patches for your filters so that they will work with Kuromoji.

Kuromoji has preliminary support for UniDic and it sounds like a good idea to 
join effort on this as well.  We could support them all; IPADIC, NAIST JDIC and 
UniDic.


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Koji Sekiguchi
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6-asf.tar.gz, 
> kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063942#comment-13063942
 ] 

Koji Sekiguchi commented on LUCENE-3305:


Hi Christian, it's been a long time. Contribution of Kuromoji to Lucene/Solr 
sounds really nice! As already Uwe mentioned, lucene-gosen has really good 
TokenFilters, those are org.apache packages and Apache License. It will be nice 
if this Japanese tokenizer uses them. Plus, lucene-gosen can use not only 
IPADIC, but also NAIST JDIC. I'd like the tokenizer to choose dictionary in the 
future release.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
>Assignee: Koji Sekiguchi
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6-asf.tar.gz, 
> kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063928#comment-13063928
 ] 

Christian Moen commented on LUCENE-3305:


Thanks, Uwe!

I think we definitely should work together and combine the great work that 
Robert, Koji & co. have been doing on Lucene-GoSen with Kuromoji to make a 
highly attractive Japanese linguistics offering that is also an integrated part 
of Lucene/Solr.

The attributes do indeed look very nice -- excellent job!  I have several 
improvements in mind for Kuromoji (and other Japanese related code) and I'm 
looking forward to working with you to improve some of these things.

Additional to its license, an issue with GoSen (and Sen) used to be its 
segmentation quality.  To my knowledge, these analyzers still don't support 
so-called "unknown words" which means that words that are not in the 
dictionaries are treated second-rate, which impacts negatively on segmentation 
quality.





> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6-asf.tar.gz, 
> kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063910#comment-13063910
 ] 

Uwe Schindler commented on LUCENE-3305:
---

Code looks cool. I think we should first do the legal stuff and then produce 
patches. Robert is currently developing another morphological analyzer 
(Lucene-Gosen, https://code.google.com/p/lucene-gosen/), but this one uses a 
LGPL library that cannot be included with Lucene/Solr. The Lucene part has lots 
of cool attributes and additional TokenFilters, so maybe we combine 
lucene-gosen with this one (your Apache-2.0 and his TokenFilters+Attributes)? 
That would be really cool.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6-asf.tar.gz, 
> kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063885#comment-13063885
 ] 

Christian Moen commented on LUCENE-3305:


Thanks, Robert and Mark.

I'll upload new tarballs where the standard ASF license notice is being used in 
all Java source files and I've also removed author tags to comply better with 
code standards.  I've removed any Atilika Inc. copyrights from NOTICE.txt in 
both tarballs.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063846#comment-13063846
 ] 

Mark Miller commented on LUCENE-3305:
-

bq. But these things are separate, right?

Right - looks like all we need is the ASF copyright in the files. The rest can 
easily be handled after the grant goes through.

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063836#comment-13063836
 ] 

Robert Muir commented on LUCENE-3305:
-

{quote}
I looked briefly at the sources here and I think we need to put this into a 
patch rather into a tar.gz. Some of the files don't have an apache header and 
some of the files state a copyright in the ASL 2 header. Basically for the code 
grant you need to put "our" ASL header into each file.
{quote}

But these things are separate, right? Can't he just fix the license headers and 
upload a new .tar.gz?

I don't see anywhere that says a code grant should be a patch, this puts a 
burden on Christian to do all
the work, and our trunk moves too fast. Lets defer creating a patch until the 
code grant stuff is over... anyone could then turn it into a patch.


> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063814#comment-13063814
 ] 

Christian Moen commented on LUCENE-3305:


Thanks a lot, Simon.  I wasn't sure when we'd update the headers as part of the 
process, so thanks for clarifying that, too.

Kuromoji downloads IPADIC as part of its build (from our server in Japan) to 
make its data structures, which it bundles into its jar file (becomes 11M, but 
can be made a lot smaller).  Building also requires more than default 
heap-space, so it's build is a little convoluted and different from the other 
code in /modules/analysis/common.

Kuromoji is also usable independently from search, although, even though search 
perhaps is its most important application.  Would it be a good idea that I make 
a patch that puts it in /modules/analysis/kuromoji for now and that we take 
things from there?

The quickest way to get Kuromoji in there would be to check the jar file 
/modules/analysis/kuromoji/lib, but I'm not sure that's a good way to go.

I'll follow up in whatever way you prefer.  Thanks again! :)




> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063783#comment-13063783
 ] 

Simon Willnauer commented on LUCENE-3305:
-

WOW this is awesome. It seems we need to file some IP clearance here since this 
is a substantial contribution not developed in the ASF source control or on the 
mailing list. I will figure out the process here. 

I looked briefly at the sources here and I think we need to put this into a 
patch rather into a tar.gz. Some of the files don't have an apache header and 
some of the files state a copyright in the ASL 2 header. Basically for the code 
grant you need to put "our" ASL header into each file.
We also need to apply these sources to our source tree so it is very likely 
that this goes under /modules/analysis/common can you try to create a patch 
against trunk? if its is too much of a hassle you can also move the solr 
integration to a different issue. 

thanks simon

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

2011-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063756#comment-13063756
 ] 

Christian Moen commented on LUCENE-3305:


MD5 hashes for the attachments are as follows:
{code}
MD5 (kuromoji-0.7.6.tar.gz) = 70d3d2f69f0511b86ebe11484cbe1313
MD5 (kuromoji-solr-0.5.3.tar.gz) = b9a54698c9aebc264845e64d3904642d
{code}

> Kuromoji code donation - a new Japanese morphological analyzer
> --
>
> Key: LUCENE-3305
> URL: https://issues.apache.org/jira/browse/LUCENE-3305
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Christian Moen
> Attachments: kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airports) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org