[jira] [Commented] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Christian Moen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918464#comment-16918464
 ] 

Christian Moen commented on LUCENE-8959:


Sounds like a good idea.  This is also rather big rabbit hole... 

Would it be useful to consider making the digit grouping separators 
configurable as part of a bigger scheme here?

In Japanese, if you're processing text with SI numbers, I believe space is a 
valid digit grouping.

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --
>
> Key: LUCENE-8959
> URL: https://issues.apache.org/jira/browse/LUCENE-8959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> even if the tokenizer doesn't discard punctuations. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating big tokens 
> that slows down the analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-09 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859673#comment-16859673
 ] 

Christian Moen commented on LUCENE-8817:


Thanks, [~tomoko].  I don't think we should any "mecab" in the naming.  Please 
let me elaborate a bit.

Kuromoji can read MeCab format models, but Kuromoji isn't a port of MeCab.  
Kuromoji has been developed independently without inspecting or reviewing any 
MeCab source code.  This was an initial goal of the project to make sure we 
could use an Apache License.

The MeCab and Kuromoji feature sets are quite different and I think users will 
find it confusing if they expect MeCab and find that Kuromoji is much more 
limited.

I'm also unsure if Kudo-san will appreciate that we make an association by name 
like this.  It certainly doesn't give due credit to MeCab, in my opinion, which 
is a much more extensive project.

In terms of naming, what about using "statistical" instead of "mecab" for this 
class of analyzers?

I'm thinking "Viterbi" could be good to refer to in shared tokenizer code.

This said, I think it could be a good to refer to "mecab" in the dictionary 
compiler code, documentation, etc. to make sure users understand that we can 
read this model format.

Any thoughts?

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852551#comment-16852551
 ] 

Christian Moen commented on LUCENE-8816:


Separating out the dictionaries is a great idea.

[~rcmuir] made great efforts making the original dictionary tiny and some 
assumptions were made based on the value ranges of the original source data.

To me it sounds like a good idea to keep the Japanese and Korean dictionaries 
separately initially and consider combining them later on when implications of 
such combination is clear.  I agree with [~jim.ferenczi].

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8752) Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' (REIWA)

2019-04-10 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814467#comment-16814467
 ] 

Christian Moen commented on LUCENE-8752:


Thanks a lot, [~Tomoko Uchida].

> Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' 
> (REIWA)
> -
>
> Key: LUCENE-8752
> URL: https://issues.apache.org/jira/browse/LUCENE-8752
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
>
> As of May 1st, 2019, Japanese era '元号' (Gengo) will be set to '令和' (Reiwa). 
> See this article for more details:
> [https://www.bbc.com/news/world-asia-47769566]
> Currently '令和' is splitted up to '令' and '和' by {{JapaneseTokenizer}}. It 
> should be tokenized as one word so that Japanese texts including era names 
> are searched as users expect. Because the default Kuromoji dictionary 
> (mecab-ipadic) has not been maintained since 2007, a one-line patch to the 
> source CSV file is needed for this era change.
> Era name is used in many official or formal documents in Japan, so it would 
> be desirable the search systems properly handle this without adding a user 
> dictionary or using phrase query. :)
> FYI, JDK DateTime API will support the new era (in the next updates.)
> [https://blogs.oracle.com/java-platform-group/a-new-japanese-era-for-java]
> The patch is available here:
> [https://github.com/apache/lucene-solr/pull/632]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8752) Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' (REIWA)

2019-04-05 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811412#comment-16811412
 ] 

Christian Moen commented on LUCENE-8752:


Thanks for this, [~Tomoko Uchida].  I think it's a good idea to make this 
change.  I'll follow early next week.

> Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' 
> (REIWA)
> -
>
> Key: LUCENE-8752
> URL: https://issues.apache.org/jira/browse/LUCENE-8752
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
>
> As of May 1st, 2019, Japanese era '元号' (Gengo) will be set to '令和' (Reiwa). 
> See this article for more details:
> [https://www.bbc.com/news/world-asia-47769566]
> Currently '令和' is splitted up to '令' and '和' by {{JapaneseTokenizer}}. It 
> should be tokenized as one word so that Japanese texts including era names 
> are searched as users expect. Because the default Kuromoji dictionary 
> (mecab-ipadic) has not been maintained since 2007, a one-line patch to the 
> source CSV file is needed for this era change.
> Era name is used in many official or formal documents in Japan, so it would 
> be desirable the search systems properly handle this without adding a user 
> dictionary or using phrase query. :)
> FYI, JDK DateTime API will support the new era (in the next updates.)
> [https://blogs.oracle.com/java-platform-group/a-new-japanese-era-for-java]
> The patch is available here:
> [https://github.com/apache/lucene-solr/pull/632]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-7992) Kuromoji fails with UnsupportedOperationException in case of duplicate keys in the user dictionary

2017-10-12 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-7992:
--

Assignee: Christian Moen

> Kuromoji fails with UnsupportedOperationException in case of duplicate keys 
> in the user dictionary
> --
>
> Key: LUCENE-7992
> URL: https://issues.apache.org/jira/browse/LUCENE-7992
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Christian Moen
>Priority: Minor
>
> Failing is the right thing to do but the exception could clarify the source 
> of the problem. Today it just throws an UnsupportedOperationException with no 
> error message because of a call to PositiveIntOutputs.merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7992) Kuromoji fails with UnsupportedOperationException in case of duplicate keys in the user dictionary

2017-10-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202098#comment-16202098
 ] 

Christian Moen commented on LUCENE-7992:


Thanks, Adrien.  I'll have a look. 

> Kuromoji fails with UnsupportedOperationException in case of duplicate keys 
> in the user dictionary
> --
>
> Key: LUCENE-7992
> URL: https://issues.apache.org/jira/browse/LUCENE-7992
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> Failing is the right thing to do but the exception could clarify the source 
> of the problem. Today it just throws an UnsupportedOperationException with no 
> error message because of a call to PositiveIntOutputs.merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-7181) JapaneseTokenizer: Validate segmentation of User Dictionary entries on creation

2016-04-08 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-7181:
--

Assignee: Christian Moen

> JapaneseTokenizer: Validate segmentation of User Dictionary entries on 
> creation
> ---
>
> Key: LUCENE-7181
> URL: https://issues.apache.org/jira/browse/LUCENE-7181
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomás Fernández Löbbe
>Assignee: Christian Moen
> Attachments: LUCENE-7181.patch
>
>
> From the [conversation on the dev 
> list|http://mail-archives.apache.org/mod_mbox/lucene-dev/201604.mbox/%3CCAMJgJxR8gLnXi7WXkN3KFfxHu=posevxxarbbg+chce1tzh...@mail.gmail.com%3E]
> The user dictionary in the {{JapaneseTokenizer}} allows users to customize 
> how a stream is broken into tokens using a specific set of rules provided 
> like: 
> AABBBCC -> AA BBB CC
> It does not allow users to change any of the token characters like:
> (1) AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", 
> seems to only care about positions) 
> It also doesn't let a character be part of more than one token, like:
> (2) AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> ..or make the output token bigger than the input text: 
> (3) AA -> AAA (Also AIOOBE)
> Currently there is no validation for those cases, case 1 doesn't fail but 
> provide unexpected tokens. Cases 2 and 3 fail when the input text is 
> analyzed. We should add validation to the {{UserDictionary}} creation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2016-01-11 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093096#comment-15093096
 ] 

Christian Moen commented on LUCENE-6837:


Hello Mike,

Yes, I'd like to backport this to 5.5.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837 for 5.4.zip, LUCENE-6837.patch, 
> LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-27 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029762#comment-15029762
 ] 

Christian Moen commented on LUCENE-6837:


Thanks a lot, Konno-san.  Things look good.  My apologies that I couldn't look 
into this earlier.

I've attached a new patch where I've included your fix and also renamed some 
methods.  I think it's getting ready...


> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, 
> LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-27 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-6837:
---
Attachment: LUCENE-6837.patch

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, 
> LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-18 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010673#comment-15010673
 ] 

Christian Moen commented on LUCENE-6837:


Tokenizing Japanese Wikipedia seems fine with nBestCost set, but it seems like 
random-blasting doesn't pass.

Konno-san, I'm wondering if I can ask you the trouble of looking into why the 
{{testRandomHugeStrings}} fails with the latest patch?

The test basically does random-blasting with nBestCost set to 2000.  I think 
it's a good idea that we fix this before we commit.  I believe it's easily 
reproducible, but I used

{noformat}
ant test  -Dtestcase=TestJapaneseTokenizer -Dtests.method=testRandomHugeStrings 
-Dtests.seed=99EB179B92E66345 -Dtests.slow=true -Dtests.locale=sr_CS 
-Dtests.timezone=PNT -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
{noformat}

in my environment.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-18 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-6837:
---
Attachment: LUCENE-6837.patch

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995604#comment-14995604
 ] 

Christian Moen commented on LUCENE-6837:


I've attached a new patch with some minor changes:

* Made the {{System.out.printf}} being subject to VERBOSE being true
* Introduced RuntimeException to deal with the initialization error cases
* Renamed the new parameters to {{nBestCost}} and {{nBestExamples}}
* Added additional javadoc here and there to document the new functionality

I'm planning on running some stability tests with the new tokenizer parameters 
next.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-08 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-6837:
---
Attachment: LUCENE-6837.patch

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-10-28 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-6837:
--

Assignee: Christian Moen

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-10-28 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978156#comment-14978156
 ] 

Christian Moen commented on LUCENE-6837:


Thanks a lot for this, Konno-san.  Very nice work!  I like the idea to 
calculate the n-best cost using examples.

Since search mode and also extended mode solves a similar problem, I'm 
wondering if it makes sense to introduce n-best as a separate mode in itself.  
In your experience in developing the feature, do you think it makes a lot of 
sense to use it with search and extended mode?

I think I'm in favour of supporting it for all the modes, even though it 
perhaps makes the most sense for normal mode.  The reason for this is to make 
sure that the entire API for {{JapaneseTokenizer}} is functional for all the 
tokenizer modes.

I'll add a few tests and I'd like to commit this soon.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Priority: Minor
> Attachments: LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-10-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954487#comment-14954487
 ] 

Christian Moen commented on LUCENE-6837:


Thanks.  I've had a very quick look at the code and have some comments and 
questions.  I'm happy to take care of this, Koji.


> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Priority: Minor
> Attachments: LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6733) Incorrect URL causes build break - analysis/kuromoji

2015-08-11 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692824#comment-14692824
 ] 

Christian Moen commented on LUCENE-6733:


Thanks, I'll have a look.

> Incorrect URL causes build break - analysis/kuromoji
> 
>
> Key: LUCENE-6733
> URL: https://issues.apache.org/jira/browse/LUCENE-6733
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: general/build, modules/analysis
>Affects Versions: 5.2.1
> Environment: n/a
>Reporter: Susumu Fukuda
>Priority: Minor
> Attachments: LUCENE-6733.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Ivy.xml contains dictionary URL both of IPADIC and NAIST-JDIC.
> But there’re already gone. No existing. So it causes build break at 
> download-dict task.
> Google Code will be closed soon later. And SouceForge(.jp not .net) was moved 
> osdn.jp.
> Fumm… not sure how I can attach a patch file… I can’t find a field. later ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-6468) Empty kuromoji user dictionary -> NPE

2015-05-11 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen resolved LUCENE-6468.

   Resolution: Fixed
Fix Version/s: 5.x
   Trunk

> Empty kuromoji user dictionary -> NPE
> -
>
> Key: LUCENE-6468
> URL: https://issues.apache.org/jira/browse/LUCENE-6468
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Assignee: Christian Moen
> Fix For: Trunk, 5.x
>
> Attachments: LUCENE-6468.patch
>
>
> Kuromoji user dictionary takes Reader and allows for comments and other lines 
> to be ignored. But if its "empty" in the sense of no actual entries, the 
> returned FST will be null, and it will throw a confusing NPE.
> JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
> as having none at all, so I think the best fix is to fix the UserDictionary 
> api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
> and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6468) Empty kuromoji user dictionary -> NPE

2015-05-11 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537767#comment-14537767
 ] 

Christian Moen commented on LUCENE-6468:


Thanks, Ohtani-san!

I added a {{final}} being required for {{branch_5x}} for JDK 1.7 and also 
changed the empty user dictionary test to contain a user dictionary with a 
comment and some newlines (it's still empty, though).

I've committed your patch to {{trunk}} and {{branch_5x}}.


> Empty kuromoji user dictionary -> NPE
> -
>
> Key: LUCENE-6468
> URL: https://issues.apache.org/jira/browse/LUCENE-6468
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Assignee: Christian Moen
> Attachments: LUCENE-6468.patch
>
>
> Kuromoji user dictionary takes Reader and allows for comments and other lines 
> to be ignored. But if its "empty" in the sense of no actual entries, the 
> returned FST will be null, and it will throw a confusing NPE.
> JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
> as having none at all, so I think the best fix is to fix the UserDictionary 
> api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
> and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-6468) Empty kuromoji user dictionary -> NPE

2015-05-10 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-6468:
--

Assignee: Christian Moen

> Empty kuromoji user dictionary -> NPE
> -
>
> Key: LUCENE-6468
> URL: https://issues.apache.org/jira/browse/LUCENE-6468
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Assignee: Christian Moen
> Attachments: LUCENE-6468.patch
>
>
> Kuromoji user dictionary takes Reader and allows for comments and other lines 
> to be ignored. But if its "empty" in the sense of no actual entries, the 
> returned FST will be null, and it will throw a confusing NPE.
> JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
> as having none at all, so I think the best fix is to fix the UserDictionary 
> api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
> and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6468) Empty kuromoji user dictionary -> NPE

2015-05-07 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532226#comment-14532226
 ] 

Christian Moen commented on LUCENE-6468:


Good catch.  I can look into a patch for this.

> Empty kuromoji user dictionary -> NPE
> -
>
> Key: LUCENE-6468
> URL: https://issues.apache.org/jira/browse/LUCENE-6468
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>
> Kuromoji user dictionary takes Reader and allows for comments and other lines 
> to be ignored. But if its "empty" in the sense of no actual entries, the 
> returned FST will be null, and it will throw a confusing NPE.
> JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
> as having none at all, so I think the best fix is to fix the UserDictionary 
> api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
> and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

2015-02-03 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304342#comment-14304342
 ] 

Christian Moen commented on LUCENE-6216:


Thanks, Robert.

I had the same idea and I tried this out last night.  The advantage of the 
approach is that we only read the buffer data for the token attributes we use, 
but it leaves the API a bit slightly awkward in my opinion since we would have 
both a {{setToken()}} and a {{setPartOfSpeech()}}.  That said, this is still 
perhaps the best way to go for performance reasons and these APIs being very 
low-level and not commonly used.

For the sake of exploring an alternative idea; a different approach could be to 
have separate token filters set these attributes.  The tokenizer would set a 
{{CharTermAttribute}}, etc. and a {{JapaneseTokenAttribute}} (or something 
suitably named) that holds the {{Token}}.  A separate 
{{JapanesePartOfSpeechFilter}} would be responsible for setting the 
{{PartOfSpeechAttribute}} by getting the data from the 
{{JapaneseTokenAttribute}} using a {{getToken()}} method. We'd still need logic 
similar to the above to deal with {{setPartOfSpeech()}}, etc. so I don't think 
we gain anything by taking this approach, and it's a big change, too.

> Make it easier to modify Japanese token attributes downstream
> -
>
> Key: LUCENE-6216
> URL: https://issues.apache.org/jira/browse/LUCENE-6216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Christian Moen
>Priority: Minor
>
> Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
> {{BaseFormAttribute}}, etc. get their values from a 
> {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  
> This makes it cumbersome to change these token attributes later on in the 
> analysis chain since the {{Token}} instances are difficult to instantiate 
> (sort of read-only objects).
> I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
> be appropriate to update token attributes to also reflect Japanese number 
> normalization.
> I think it might be more practical to allow setting a specific value for 
> these token attributes directly rather than through a {{Token}} since it 
> makes the APIs simpler, allows for easier changing attributes downstream, and 
> also supporting additional dictionaries easier.
> The drawback with the approach that I can think of is a performance hit as we 
> will miss out on the inherent lazy retrieval of these token attributes from 
> the {{Token}} object (and the underlying dictionary/buffer).
> I'd like to do some testing to better understand the performance impact of 
> this change. Happy to hear your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

2015-02-03 Thread Christian Moen (JIRA)
Christian Moen created LUCENE-6216:
--

 Summary: Make it easier to modify Japanese token attributes 
downstream
 Key: LUCENE-6216
 URL: https://issues.apache.org/jira/browse/LUCENE-6216
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Christian Moen
Priority: Minor


Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
{{BaseFormAttribute}}, etc. get their values from a 
{{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  This 
makes it cumbersome to change these token attributes later on in the analysis 
chain since the {{Token}} instances are difficult to instantiate (sort of 
read-only objects).

I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
be appropriate to update token attributes to also reflect Japanese number 
normalization.

I think it might be more practical to allow setting a specific value for these 
token attributes directly rather than through a {{Token}} since it makes the 
APIs simpler, allows for easier changing attributes downstream, and also 
supporting additional dictionaries easier.

The drawback with the approach that I can think of is a performance hit as we 
will miss out on the inherent lazy retrieval of these token attributes from the 
{{Token}} object (and the underlying dictionary/buffer).

I'd like to do some testing to better understand the performance impact of this 
change. Happy to hear your thoughts on this.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-02-03 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

Minor updates to javadoc.

I'll leave reading attributes, etc. unchanged for now and get back to resolving 
this once we have better mechanisms in place for updating some of the Japanese 
token attributes downstream.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Fix For: 5.1
>
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-02-02 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

Updated patch with decimal number support, additional javadoc and the test code 
now makes precommit happy.

Token-attributes such as part-of-speech, readings, etc. for the normalized 
token is currently inherited from the last token used when composing the 
normalized number. Since these values are likely to be wrong, I'm inclined to 
set this attributes to null or a reasonable default.

I'm very happy to hear your thoughts on this.



> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Fix For: 5.1
>
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-02-02 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Fix Version/s: 5.1

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Fix For: 5.1
>
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-01-28 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

New patch with CHANGES.txt and services entry.

Will do some end-to-end testing next.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-01-28 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296379#comment-14296379
 ] 

Christian Moen commented on LUCENE-3922:


Please feel free to test it.  Feedback is very welcome.

The patch is against {{trunk}} and this should make it into 5.1.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-01-21 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

Added factory and wrote javadoc.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-16 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173567#comment-14173567
 ] 

Christian Moen commented on LUCENE-3922:


Gaute and myself have done testing on real-world data and we've uncovered and 
fixed a couple of corner-case issues.

Our todo items are as follows:

# Do additional testing and possible add additional number formats
# Document some unsupported cases in unit-tests
# Add class-level javadoc
# Add a Solr factory



> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-16 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-09 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164954#comment-14164954
 ] 

Christian Moen commented on LUCENE-3922:


I've attached a new patch.

The {{checkRandomData}} issues were caused by improper handling of token 
composition for graphs (bug found by [~gaute]). Tokens preceded by position 
increment zero token are left untouched and so are stacked/synonym tokens.

We'll do some more testing and add some documentation before we move forward to 
commit this.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-09 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-08-05 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085909#comment-14085909
 ] 

Christian Moen commented on LUCENE-3922:


Gaute and myself have been doing some work on this and we have rewritten this 
as a {{TokenFilter}}.

A few comments:

* We have added support for numbers such as 3.2兆円 as you requested, Kazu.
* We could potentially use a POS-tag attribute from Kuromoji to identify number 
that we are composing, but perhaps not relying on POS-tags makes this filter 
also useful in the case of n-gramming.
* We haven't implemented any of the anchoring logic discussed above, i.e. if we 
to restrict normalization to prices, etc. Is this useful to have?
* Input such as {{1,5}} becomes {{15}} after normalization, which could be 
undesired. Is this bad input or do we want anchoring to retain these numbers?

One thing though, in order to support some of this number parsing, i.e. cases 
such as 3.2兆円, we need to use Kuromoji in a mode that retains punctuation 
characters.

There's also an unresolved issue found by {{checkRandomData}} that we haven't 
tracked down and fixed, yet.

This is a work in progress and feedback is welcome.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-08-04 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---

Attachment: LUCENE-3922.patch

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2014-02-13 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901180#comment-13901180
 ] 

Christian Moen commented on SOLR-1301:
--

I've been reading through (pretty much all) the comments on this JIRA and I'd 
like to thank you all for the great effort you have put into this.

> Add a Solr contrib that allows for building Solr indexes via Hadoop's 
> Map-Reduce.
> -
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
>  Issue Type: New Feature
>Reporter: Andrzej Bialecki 
>Assignee: Mark Miller
> Fix For: 5.0, 4.7
>
> Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
> SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
> hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
> log4j-1.2.15.jar
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> --
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-N directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820093#comment-13820093
 ] 

Christian Moen commented on LUCENE-2899:


bq. Stuff like this NER should NOT be in the analysis chain. as i said, its 
more useful in the "document build" phase anyway.

+1

Benson, as far as I understand, ES doesn't have the concept by design.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-16 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13796545#comment-13796545
 ] 

Christian Moen commented on LUCENE-4956:


SooMyung, I've committed the latest changes we merged in Seoul on Monday.  It's 
great if you can fix the decompounding issue we came across, which we disabled 
a test for.

Uwe, +1 to use {{Class#getResourceAsStream}} and remove {{FileUtils}} and 
{{JarResources}}.  I'll make these changes and commit to the branch.

Overall, I think there's a lot of things we can do to improve this code.  Would 
very much like to hear your opinion on what we should fix before committing to 
trunk and getting this on the 4.x branch and improve from there.  My thinking 
is that it might be good to get this committed so we'll have Korean working 
even though the code needs some work.  SooMyung has a community in Korea that 
uses and it's serving their needs as far as I understand.

Happy to hear people's opinion on this.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, 
> LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-14 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794046#comment-13794046
 ] 

Christian Moen commented on LUCENE-4956:


Soomyung and myself met up in Seoul today and we've merged his latest locally.  
I'll commit the changes to this branch when I'm back in Tokyo and Soomyung will 
follow up with fixing a known issue afterwards.  Hopefully we can commit this 
to trunk very soon.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, 
> LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789944#comment-13789944
 ] 

Christian Moen commented on LUCENE-4956:


Thanks a lot.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, 
> LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789052#comment-13789052
 ] 

Christian Moen commented on LUCENE-4956:


SooMyung,

The patch you uploaded on September 11th, was that made against the latest 
{{lucene4956}} branch?

The patch doesn't apply properly against on {{lucene4956}} for me.  Could you 
clarify its origin and instruct me how it can be applied?  If you can make a 
patch against the code on {{lucene4956}}, that would be much appreciated.

Thanks!

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, 
> LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-04 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786905#comment-13786905
 ] 

Christian Moen commented on LUCENE-4956:


Thanks for pushing me on this.  I'll have a look at your recent changes and 
commit to trunk shortly if everything seems fine.  I hope to have this 
committed to trunk early next week.  Sorry for this having dragged out.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, 
> LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5244) NPE in Japanese Analyzer

2013-09-25 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777560#comment-13777560
 ] 

Christian Moen commented on LUCENE-5244:


Hello Benson,

In your code on Github, try calling {{tokenStream.reset()}} before consumption.

> NPE in Japanese Analyzer
> 
>
> Key: LUCENE-5244
> URL: https://issues.apache.org/jira/browse/LUCENE-5244
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.4
>Reporter: Benson Margulies
>
> I've got a test case that shows an NPE with the Japanese analyzer.
> It's all available in https://github.com/benson-basis/kuromoji-npe, and I 
> explicitly grant a license to the Foundation.
> If anyone would prefer that I attach a tarball here, just let me know.
> {noformat}
> ---
>  T E S T S
> ---
> Running com.basistech.testcase.JapaneseNpeTest
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.298 sec <<< 
> FAILURE! - in com.basistech.testcase.JapaneseNpeTest
> japaneseNpe(com.basistech.testcase.JapaneseNpeTest)  Time elapsed: 0.282 sec  
> <<< ERROR!
> java.lang.NullPointerException: null
>   at 
> org.apache.lucene.analysis.util.RollingCharBuffer.get(RollingCharBuffer.java:86)
>   at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:618)
>   at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:468)
>   at 
> com.basistech.testcase.JapaneseNpeTest.japaneseNpe(JapaneseNpeTest.java:28)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739392#comment-13739392
 ] 

Christian Moen commented on LUCENE-4956:


SooMyung, let's sync up regarding your latest changes (the patch you attached). 
 I'm thinking perhaps we can merge to {{trunk}} first and iterate from there.  
Thanks.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739387#comment-13739387
 ] 

Christian Moen commented on LUCENE-4956:


Attaching a patch against {{trunk}} (r1513348).


> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-4956:
---

Attachment: LUCENE-4956.patch

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-13 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739301#comment-13739301
 ] 

Christian Moen commented on LUCENE-4956:


I've now aligned the branch with {{trunk}}, updated the example {{schema.xml}} 
to use {{text_ko}} naming for the Korean field type.

I've also indexed Korean Wikipedia continuously for a few hours and the JVM 
heap looks fine.

There are several additional things that can be done with this code, including 
generating the parser using JFlex at build time, fixing some of the position 
issues with random-blasting, cleanups and dead-code removal, etc.  This said, I 
believe the code we have is useful to Korean users as-is and I'm thinking it's 
a good idea to integrate it into {{trunk}} and iterate further from there.

Please share your thoughts.  Thanks.


> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-07-09 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704147#comment-13704147
 ] 

Christian Moen commented on LUCENE-4956:


Hello SooMyung,

I'm the one who haven't followed up properly on this as I've been too bogged 
down with other things.  I've set aside time next week to work on this and I 
hope to have Korean merged and integrated with {{trunk}} then.  I'm not sure we 
can make 4.4, but I'm willing to put in extra effort if there's a chance we can 
get it in in time.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken

2013-06-26 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694148#comment-13694148
 ] 

Christian Moen commented on SOLR-4945:
--

No, it's not. {{JapaneseTokenizerFactory}} is available in 3.6 or newer.  
Kindly upgrade to the latest version of Solr (currently 4.3.1) and see if the 
problem persists.  If it does, please indicate how you reproduced it in detail 
so we can start investigating the cause.  Thanks.

> Japanese Autocomplete and Highlighter broken
> 
>
> Key: SOLR-4945
> URL: https://issues.apache.org/jira/browse/SOLR-4945
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Reporter: Shruthi Khatawkar
>
> Autocomplete is implemented with Highlighter functionality. This works fine 
> for most of the languages but breaks for Japanese.
> multivalued,termVector,termPositions and termOffset are set to true.
> Here is an example:
> Query: product classic.
> Result:
> Actual : 
> この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic 
> Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか?
> With Highlighter (  tags being used):
> この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 
> USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか?
> Though query terms "product classic" is repeated twice, highlighting is 
> happening only on the first instance. As shown above.
> Solr returns only first instance offset and second instance is ignored.
> Also it's observed, highlighter repeats first letter of the token if there is 
> numeric.
> For eg.Query : product and We have product1, highlighter returns as 
> pproduct1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken

2013-06-26 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693894#comment-13693894
 ] 

Christian Moen commented on SOLR-4945:
--

Hello Shruthi,

Could you confirm if you see this problem when using 
{{JapaneseTokenizerFactory}}?  

{{SenTokenizerFactory}} isn't part of Solr and if you are seeing funny offsets 
there, that could be the root cause of this.  This is my speculation only -- I 
really don't know...

I believe {{JapaneseTokenizerFactory}} in normal mode gives a similar 
segmentation to {{SenTokenizer}} and it would be good to see if we can 
reproduce this using {{JapaneseTokenizerFactory}}.

Many thanks.

> Japanese Autocomplete and Highlighter broken
> 
>
> Key: SOLR-4945
> URL: https://issues.apache.org/jira/browse/SOLR-4945
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Reporter: Shruthi Khatawkar
>
> Autocomplete is implemented with Highlighter functionality. This works fine 
> for most of the languages but breaks for Japanese.
> multivalued,termVector,termPositions and termOffset are set to true.
> Here is an example:
> Query: product classic.
> Result:
> Actual : 
> この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic 
> Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか?
> With Highlighter (  tags being used):
> この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 
> USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか?
> Though query terms "product classic" is repeated twice, highlighting is 
> happening only on the first instance. As shown above.
> Solr returns only first instance offset and second instance is ignored.
> Also it's observed, highlighter repeats first letter of the token if there is 
> numeric.
> For eg.Query : product and We have product1, highlighter returns as 
> pproduct1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken

2013-06-21 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13690131#comment-13690131
 ] 

Christian Moen commented on SOLR-4945:
--

Hello Shruthi,

Does this have anything to do with autocomplete or is this solely a 
highlighting issue?  Which field type are you using?  Are you using 
JapaneseTokenizer as part of this field type with search mode turned on?  
Thanks.


> Japanese Autocomplete and Highlighter broken
> 
>
> Key: SOLR-4945
> URL: https://issues.apache.org/jira/browse/SOLR-4945
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Reporter: Shruthi Khatawkar
>
> Autocomplete is implemented with Highlighter functionality. This works fine 
> for most of the languages but breaks for Japanese.
> multivalued,termVector,termPositions and termOffset are set to true.
> Here is an example:
> Query: product classic.
> Result:
> Actual : 
> この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic 
> Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか?
> With Highlighter (  tags being used):
> この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 
> USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか?
> Though query terms "product classic" is repeated twice, highlighting is 
> happening only on the first instance. As shown above.
> Solr returns only first instance offset and second instance is ignored.
> Also it's observed, highlighter repeats first letter of the token if there is 
> numeric.
> For eg.Query : product and We have product1, highlighter returns as 
> pproduct1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5013) ScandinavianInterintelligableASCIIFoldingFilter

2013-05-22 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664383#comment-13664383
 ] 

Christian Moen commented on LUCENE-5013:


bq. Though maybe the name could be simpler, (ScandinavianNormalizationFilter?)

+1

> ScandinavianInterintelligableASCIIFoldingFilter
> ---
>
> Key: LUCENE-5013
> URL: https://issues.apache.org/jira/browse/LUCENE-5013
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.3
>Reporter: Karl Wettin
>Priority: Trivial
> Attachments: LUCENE-5013.txt
>
>
> This filter is an augmentation of output from ASCIIFoldingFilter,
> it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the 
> first one.
> blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
> räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas
> Caveats:
> Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been 
> folded down to aoaoae when handled by this filter it will cause effects such 
> as:
> bøen -> boen -> bon
> åene -> aene -> ane
> I find this to be a trivial problem compared to not finding anything at all.
> Background:
> Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus 
> interchangeable in when used between these languages. They are however folded 
> differently when people type them on a keyboard lacking these characters and 
> ASCIIFoldingFilter handle ä and æ differently.
> When a Swedish person is lacking umlauted characters on the keyboard they 
> consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, 
> a, o.
> In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use 
> a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark 
> but the pattern is probably the same.
> This filter solves that problem, but might also cause new.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-22 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664228#comment-13664228
 ] 

Christian Moen commented on LUCENE-4956:


Thanks a lot!

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-22 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664223#comment-13664223
 ] 

Christian Moen commented on LUCENE-4956:


I'm happy to take care of this unless you want to do it, Steve.  I can do this 
either tomorrow or on Friday.  Thanks.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar, lucene4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-17 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660528#comment-13660528
 ] 

Christian Moen commented on LUCENE-4956:


I've run {{KoreanAnalyzer}} on Korean Wikipedia and also had a look at 
memory/heap usage.  Things look okay overall.

I believe {{KoreanFilter}} uses wrong offsets for synonym tokens, which was 
discovered by random-blasting.  Looking into the issue...

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4813) Unavoidable IllegalArgumentException occurs when SynonymFilterFactory's setting has tokenizer factory's parameter.

2013-05-14 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13657937#comment-13657937
 ] 

Christian Moen commented on SOLR-4813:
--

Good work. Thanks!

> Unavoidable IllegalArgumentException occurs when SynonymFilterFactory's 
> setting has tokenizer factory's parameter.
> --
>
> Key: SOLR-4813
> URL: https://issues.apache.org/jira/browse/SOLR-4813
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 4.3
>Reporter: Shingo Sasaki
>Assignee: Hoss Man
>Priority: Critical
>  Labels: SynonymFilterFactory
> Fix For: 5.0, 4.4, 4.3.1
>
> Attachments: SOLR-4813__4x.patch, SOLR-4813.patch, SOLR-4813.patch
>
>
> When I write SynonymFilterFactory' setting in schema.xml as follows, ...
> {code:xml}
> 
>minGramSize="2"/>
>ignoreCase="true" expand="true"
>tokenizerFactory="solr.NGramTokenizerFactory" maxGramSize="2" 
> minGramSize="2"/>
> 
> {code}
> IllegalArgumentException ("Unknown parameters") occurs.
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Unknown parameters: 
> {maxGramSize=2, minGramSize=2}
>   at 
> org.apache.lucene.analysis.synonym.FSTSynonymFilterFactory.(FSTSynonymFilterFactory.java:71)
>   at 
> org.apache.lucene.analysis.synonym.SynonymFilterFactory.(SynonymFilterFactory.java:50)
>   ... 28 more
> {noformat}
> However TokenizerFactory's params should be set to loadTokenizerFactory 
> method in [FST|Slow]SynonymFilterFactory. (ref. SOLR-2909)
> I think, the problem was caused by LUCENE-4877 ("Fix analyzer factories to 
> throw exception when arguments are invalid") and SOLR-3402 ("Parse Version 
> outside of Analysis Factories").

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-14 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656919#comment-13656919
 ] 

Christian Moen commented on LUCENE-4956:


Hello SooMyung,

Thanks for the above regarding field type.  The general approach we have taken 
in Lucene is to do the same analysis at both index and query side.  For 
example, the Japanese analyzer also has functionality to do compound splitting 
and we've discussed doing this one the index side only per default for field 
type {{text_ja}}, but we decided against it.

I've included your field type in the latest code I've checked in just now, but 
it's likely that we will change this in the future.

I'm wondering if you could help me with a few sample sentences that illustrates 
the various options {{KoreanFilter}} has.  I'd like to add some test-cases for 
these to better understand the differences between them and to verify correct 
behaviour.  Test-cases for this is also a useful way to document functionality 
in general.  Thanks for any help with this!

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-14 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656901#comment-13656901
 ] 

Christian Moen commented on LUCENE-4956:


Thanks, Steve & co.!

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-08 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651826#comment-13651826
 ] 

Christian Moen commented on LUCENE-4956:


Updates:

* Added {{text_kr}} field type to {{schema.xml}}
* Fixed Solr factories to load field type {{text_kr}} in the example
* Updated javadoc so that it compiles cleanly (mostly removed illegal javadoc)
* Updated various build things related to include Korean in the Solr 
distribution
* Added placeholder stopwords file
* Added services for arirang

Korean analysis using field type {{text_kr}} seems to be doing the right thing 
out-of-the-box now, but some configuration options in the factories aren't 
working as of now.  There are several other things that needs polishing up, but 
we're making progress.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-06 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649833#comment-13649833
 ] 

Christian Moen commented on LUCENE-4956:


bq. I think we're ready for the incubator-general vote. [~cm], do you agree?

+1 

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-05 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649379#comment-13649379
 ] 

Christian Moen commented on LUCENE-4956:


Good points, Uwe.  I'll look into this.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-05 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649296#comment-13649296
 ] 

Christian Moen commented on LUCENE-4956:


Thanks, Steve.  I've added the missing license header to 
{{TestKoreanAnalyzer.java}}.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-04 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649260#comment-13649260
 ] 

Christian Moen commented on LUCENE-4956:


I've created branch {{lucene4956}} and checked in an {{arirang}} module in 
{{lucene/analysis}}.  I've added a basic test that tests segmentation, offsets, 
etc.

Other updates:

* Some compilation warnings related to generics have been fixed, but several 
are to go.
* License headers have been added to all source code files
* Author tags have been removed from all files, except {{StringUtils}} pending 
SooMyoung's feedback (see above)
* Added IntelliJ IDEA config to make {{ant idea}} set things up correctly.  
Eclipse is TODO.

My next step is to fix the compilation related warning altogether and once we 
confirmed {{StringUtils}}, I think we can do the incubator-general vote.  I'll 
keep you posted.

I think we should also consider rewriting and optimise some of the code here 
and there, but that's for later.  It's great if you can be involved in this 
process, SooMyoung!  I'll probably need your help and good advice here and 
there. :)

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-04 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649258#comment-13649258
 ] 

Christian Moen commented on LUCENE-4956:


Hello SooMyoung,

Could you comment about the origins and authorship of 
{{org.apache.lucene.analysis.kr.utils.StringUtil}} in your tar file?

I'm seeing a lot of authors in this file. Is this from Apache Commons Lang?  
Thanks!

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-04 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-4956:
--

Assignee: Christian Moen

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>Assignee: Christian Moen
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-04 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649125#comment-13649125
 ] 

Christian Moen commented on LUCENE-4956:


A quick status update on my side is as follows:

I've put the code into an a module called {{arirang}} on my local setup and 
made a few changes necessary to make things work on {{trunk}}. 
{{KoreanAnalyzer}} now produces Korean tokens and some tests I've made passes 
when run from my IDE.

Loading the dictionaries as resources need some work and I'll spend time on 
this during the weekend.  I'll also address the headers, etc. to prepare for 
the incubator-general vote.

Hopefully, I'll have all this on a branch this weekend.  I'll keep you posted 
and we can take things from there.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-01 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646960#comment-13646960
 ] 

Christian Moen commented on LUCENE-4956:


SooMyung, I don't think you need to do anything at this point.  I think a good 
next step is that we create a new branch and check the code you have submitted 
onto that branch.  We can then start looking into addressing the headers and 
other items that people have pointed out in comments.  (Thanks, Jack and 
Edward!)

Steve, will there be a vote after the code has been checked onto the branch?  
If you think the above is a good next step, I'm happy to start working on this 
either later this week or next week.  Kindly let me know you prefer to proceed. 
 Thanks.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-04-27 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643901#comment-13643901
 ] 

Christian Moen commented on LUCENE-4956:


The Korean analyzer should be named named 
{{org.apache.lucene.analysis.kr.KoreanAnalyzer}} and we'll provide a 
ready-to-use field type {{text_kr}} in {{schema.xml}} for Solr users, which is 
consistent with what we do for other languages.

As for where the analyzer code itself lives, I think it's fine to put it in 
{{lucene/analysis/arirang}}.  The file {{lucene/analysis/README.txt}} documents 
what these modules are and the code is easily and directly retrievable in IDEs 
by looking up {{KoreanAnalyzer}} (the source code paths will be set up by {{ant 
eclipse}} and {{ant idea}}).

One reason analyzers have not been put in {{lucene/analysis/common} in the past 
is that they require dictionaries that are several megabytes large.

Overall, I don't think the scheme we are using is all that problematic, but 
it's true that {{MorfologikAnalyzer}} and {{SmartChineseAnalyzer}} doesn't 
align with it.  The scheme doesn't easily lend itself to different 
implementations for one language, but that's not a common case today although 
it might become more common in the future.

In the case of Norwegian (no), there are ISO language codes for both Bokmål 
(bm) and Nynorsk (nn), and one way of supporting this is also to consider these 
as options to {{NorwegianAnalyzer}} since both languages are Norwegian.  See 
SOLR-4565 for thoughts on how to extend support in 
{{NorwegianMinimalStemFilter}} for this.

A similar overall approach might make sense when there are multiple 
implementations of a language; end-users can use a analyzer named 
{{Analyzer}} without requiring users to study the difference in 
implementation before using.  I also see problems with this, but it's just a 
thought...

I'm all for improving our scheme, but perhaps we can open up a separate JIRA 
for this and keep this one focused on Korean?





> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-04-24 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641365#comment-13641365
 ] 

Christian Moen commented on LUCENE-4956:


Thanks again, SooMyung!

I'm seeing that Steven has informed you about the grant process on the mailing 
list.  I'm happy to also facilitate this process with Steven.

Looking forward to getting Korean supported.


> the korean analyzer that has a korean morphological analyzer and dictionaries
> -
>
> Key: LUCENE-4956
> URL: https://issues.apache.org/jira/browse/LUCENE-4956
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.2
>Reporter: SooMyung Lee
>  Labels: newbie
> Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata

2013-04-24 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640551#comment-13640551
 ] 

Christian Moen commented on LUCENE-4947:


Kevin,

I think it's best that you do the license change yourself and that we don't 
have any active role in making the change since you are the only person 
entitled to make the change.

This change can be done by using the below header on all the source code and 
other relevant text files:

{noformat}
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
{noformat}

After this has been done, please make a tarball and attach it to this JIRA and 
indicate that this is the code you wish to grant and also inform us about the 
MD5 hash of the tarball.  (This will go into the IP-clearance document and will 
be used to identify the codebase.)

It's a good idea to also use this MD5 hash as part of Exhibit A in the 
[software-grant.txt|http://www.apache.org/licenses/software-grant.txt] 
agreement unless you have signed and submitted this already.  (If you donate 
the code yourself by attaching it to the JIRA as described above, I believe the 
hashes not being part of Exhibit A is acceptable.)

Please feel free to add your comments, Steve.


> Java implementation (and improvement) of Levenshtein & associated lexicon 
> automata
> --
>
> Key: LUCENE-4947
> URL: https://issues.apache.org/jira/browse/LUCENE-4947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
>Reporter: Kevin Lawson
>
> I was encouraged by Mike McCandless to open an issue concerning this after I 
> contacted him privately about it. Thanks Mike!
> I'd like to submit my Java implementation of the Levenshtein Automaton as a 
> homogenous replacement for the current heterogenous, multi-component 
> implementation in Lucene.
> Benefits of upgrading include 
> - Reduced code complexity
> - Better performance from components that were previously implemented in 
> Python
> - Support for on-the-fly dictionary-automaton manipulation (if you wish to 
> use my dictionary-automaton implementation)
> The code for all the components is well structured, easy to follow, and 
> extensively commented. It has also been fully tested for correct 
> functionality and performance.
> The levenshtein automaton implementation (along with the required MDAG 
> reference) can be found in my LevenshteinAutomaton Java library here: 
> https://github.com/klawson88/LevenshteinAutomaton.
> The minimalistic directed acyclic graph (MDAG) which the automaton code uses 
> to store and step through word sets can be found here: 
> https://github.com/klawson88/MDAG
> *Transpositions aren't currently implemented. I hope the comment filled, 
> editing-friendly code combined with the fact that the section in the Mihov 
> paper detailing transpositions is only 2 pages makes adding the functionality 
> trivial.
> *As a result of support for on-the-fly manipulation, the MDAG 
> (dictionary-automaton) creation process incurs a slight speed penalty. In 
> order to have the best of both worlds, i'd recommend the addition of a 
> constructor which only takes sorted input. The complete, easy to follow 
> pseudo-code for the simple procedure can be found in the first article I 
> linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata

2013-04-22 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638128#comment-13638128
 ] 

Christian Moen commented on LUCENE-4947:


It sounds proper to do a code grant also because the software currently has a 
GPL license.  Thanks for following up, Steve.

> Java implementation (and improvement) of Levenshtein & associated lexicon 
> automata
> --
>
> Key: LUCENE-4947
> URL: https://issues.apache.org/jira/browse/LUCENE-4947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
>Reporter: Kevin Lawson
>
> I was encouraged by Mike McCandless to open an issue concerning this after I 
> contacted him privately about it. Thanks Mike!
> I'd like to submit my Java implementation of the Levenshtein Automaton as a 
> homogenous replacement for the current heterogenous, multi-component 
> implementation in Lucene.
> Benefits of upgrading include 
> - Reduced code complexity
> - Better performance from components that were previously implemented in 
> Python
> - Support for on-the-fly dictionary-automaton manipulation (if you wish to 
> use my dictionary-automaton implementation)
> The code for all the components is well structured, easy to follow, and 
> extensively commented. It has also been fully tested for correct 
> functionality and performance.
> The levenshtein automaton implementation (along with the required MDAG 
> reference) can be found in my LevenshteinAutomaton Java library here: 
> https://github.com/klawson88/LevenshteinAutomaton.
> The minimalistic directed acyclic graph (MDAG) which the automaton code uses 
> to store and step through word sets can be found here: 
> https://github.com/klawson88/MDAG
> *Transpositions aren't currently implemented. I hope the comment filled, 
> editing-friendly code combined with the fact that the section in the Mihov 
> paper detailing transpositions is only 2 pages makes adding the functionality 
> trivial.
> *As a result of support for on-the-fly manipulation, the MDAG 
> (dictionary-automaton) creation process incurs a slight speed penalty. In 
> order to have the best of both worlds, i'd recommend the addition of a 
> constructor which only takes sorted input. The complete, easy to follow 
> pseudo-code for the simple procedure can be found in the first article I 
> linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata

2013-04-22 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637943#comment-13637943
 ] 

Christian Moen commented on LUCENE-4947:


Thanks a lot for wishing to submit code!

It's not possible to include your code in Lucene if it has a GPL license.  
Quite frankly, I don't think even think Lucene committers can even have a look 
at it to consider it for inclusion with a GPL license.

If you have written all the code or otherwise own all copyrights, would you 
mind switching to Apache License 2.0?  That way, I at least think it would be 
possible to have a close look to see if this is a good fit for Lucene.

> Java implementation (and improvement) of Levenshtein & associated lexicon 
> automata
> --
>
> Key: LUCENE-4947
> URL: https://issues.apache.org/jira/browse/LUCENE-4947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
>Reporter: Kevin Lawson
>
> I was encouraged by Mike McCandless to open an issue concerning this after I 
> contacted him privately about it. Thanks Mike!
> I'd like to submit my Java implementation of the Levenshtein Automaton as a 
> homogenous replacement for the current heterogenous, multi-component 
> implementation in Lucene.
> Benefits of upgrading include 
> - Reduced code complexity
> - Better performance from components that were previously implemented in 
> Python
> - Support for on-the-fly dictionary-automaton manipulation (if you wish to 
> use my dictionary-automaton implementation)
> The code for all the components is well structured, easy to follow, and 
> extensively commented. It has also been fully tested for correct 
> functionality and performance.
> The levenshtein automaton implementation (along with the required MDAG 
> reference) can be found in my LevenshteinAutomaton Java library here: 
> https://github.com/klawson88/LevenshteinAutomaton.
> The minimalistic directed acyclic graph (MDAG) which the automaton code uses 
> to store and step through word sets can be found here: 
> https://github.com/klawson88/MDAG
> *Transpositions aren't currently implemented. I hope the comment filled, 
> editing-friendly code combined with the fact that the section in the Mihov 
> paper detailing transpositions is only 2 pages makes adding the functionality 
> trivial.
> *As a result of support for on-the-fly manipulation, the MDAG 
> (dictionary-automaton) creation process incurs a slight speed penalty. In 
> order to have the best of both worlds, i'd recommend the addition of a 
> constructor which only takes sorted input. The complete, easy to follow 
> pseudo-code for the simple procedure can be found in the first article I 
> linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3706) Ship setup to log with log4j.

2013-03-15 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603516#comment-13603516
 ] 

Christian Moen commented on SOLR-3706:
--

{quote}
Mark, have you tried Logback? That's a good logging implementation; arguably a 
better one.
{quote}

David and Mark, I believe [Log4J 2|http://logging.apache.org/log4j/2.x/|] 
addresses a lot of the weaknesses in Log4J 1.x also addressed by Logback.  
However, Log4J 2 hasn't been released yet.

To me it sounds like a good idea to use Log4J 1.x now and move to Log4J 2 in 
the future.

> Ship setup to log with log4j.
> -
>
> Key: SOLR-3706
> URL: https://issues.apache.org/jira/browse/SOLR-3706
> Project: Solr
>  Issue Type: Improvement
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 4.3, 5.0
>
> Attachments: SOLR-3706-solr-log4j.patch
>
>
> Currently we default to java util logging and it's terrible in my opinion.
> *It's simple built in logger is a 2 line logger.
> *You have to jump through hoops to use your own custom formatter with jetty - 
> either putting your class in the start.jar or other pain in the butt 
> solutions.
> *It can't roll files by date out of the box.
> I'm sure there are more issues, but those are the ones annoying me now. We 
> should switch to log4j - it's much nicer and it's easy to get a nice single 
> line format and roll by date, etc.
> If someone wants to use JUL they still can - but at least users could start 
> with something decent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4407) SSL auth or basic auth in SolrCloud

2013-02-06 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572524#comment-13572524
 ] 

Christian Moen commented on SOLR-4407:
--

Thanks a lot for clarifying, Jan.  I wasn't aware of this limitation.

> SSL auth or basic auth in SolrCloud
> ---
>
> Key: SOLR-4407
> URL: https://issues.apache.org/jira/browse/SOLR-4407
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Affects Versions: 4.1
>Reporter: Sindre Fiskaa
>  Labels: Authentication, Certificate, SSL
> Fix For: 4.2, 5.0
>
>
> I need to be able to secure sensitive information in solrnodes running in a 
> SolrCloud with either SSL client/server certificates or http basic auth..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4407) SSL auth or basic auth in SolrCloud

2013-02-06 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572507#comment-13572507
 ] 

Christian Moen commented on SOLR-4407:
--

I don't think this is a Solr issue, but it might be helpful to provide general 
information on how to secure Solr's interfaces.  However, how to set this up is 
Servlet container specific.  Could you clarify what you had in mind for this?  
Thanks.

> SSL auth or basic auth in SolrCloud
> ---
>
> Key: SOLR-4407
> URL: https://issues.apache.org/jira/browse/SOLR-4407
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Affects Versions: 4.1
>Reporter: Sindre Fiskaa
>  Labels: Authentication, Certificate, SSL
>
> I need to be able to secure sensitive information in solrnodes running in a 
> SolrCloud with either SSL client/server certificates or http basic auth..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-11 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474287#comment-13474287
 ] 

Christian Moen commented on LUCENE-3922:


Ohtani-san,

I saw your tweet about this earlier and it sounds like a very good idea.  
Thanks.

I will try to set aside some time to work on this.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-11 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474224#comment-13474224
 ] 

Christian Moen commented on LUCENE-3922:


Thanks, Kazu.

I'm aware of the issue and the thinking is to rework this as a {{TokenFilter}} 
and use anchoring options with surrounding tokens to decide if normalisation 
should take place, i.e. if the preceding token is ¥ or the following token is 円 
in the case of normalising prices.

It might also be helpful to look into using POS-info for this to benefit from 
what we actually know about the token, i.e. to not apply normalisation if the 
POS tag is a person name.

Other suggestions and ideas are of course most welcome.


> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-06 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471132#comment-13471132
 ] 

Christian Moen commented on LUCENE-3922:


{quote}
Is it difficult to support numbers with period as the following?
3.2兆円
5.2億円
{quote}

Supporting this is no problem and a good idea.

{quote}
I think It would be helpful that this charfilter supports old Kanji numeric 
characters ("KYU-KANJI" or "DAIJI") such as 壱, 壹 (One), 弌, 弐, 貳 (Two), 弍, 参,參 
(Three), or configureable.
{quote}

This is also easy to support.

As for making preserving zeros configurable, that's also possible, of course.

It's great to get more feedback on what sort of functionality we need and what 
should be configurable options. Hopefully, we can find a good balance without 
adding too much complexity.

Thanks for the feedback.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-06 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470967#comment-13470967
 ] 

Christian Moen commented on LUCENE-3921:


Lance,

The idea I had in mind for Japanese uses language specific characteristics for 
katakana terms and perhaps weights that are dictionary-specific as well.  
However, we are hacking the our statistical model here and there are 
limitations as to how far we can go with this.

I don't know a whole lot about the Smart Chinese toolkit, but I believe the 
same approach to compound segmentation could work for Chinese as well.  
However, weights and implementation would likely to be separate.  Note that the 
above is really about one specific kind of compound segmentation that applies 
to Japanese so the thinking was to add additional heuristics for this specific 
type that is particularly tricky.

It might be a good idea to approach this problem also using the 
{{DictionaryCompoundWordTokenFilter}} and collectively build some lexical 
assets for compound splitting for the relevant languages than hacking our 
models.

> Add decompose compound Japanese Katakana token capability to Kuromoji
> -
>
> Key: LUCENE-3921
> URL: https://issues.apache.org/jira/browse/LUCENE-3921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
> Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>Reporter: Kazuaki Hiraga
>  Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to 
> decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
> that some Katakana tokens can be decomposed, but it cannot be applied every 
> Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" 
> don't decompose into "トート バッグ" and "ショルダー バッグ" although the IPA dictionary 
> has "バッグ" in its entry.  I would like to apply the decompose feature to every 
> Katakana tokens if the sub-tokens are in the dictionary or add the capability 
> to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463637#comment-13463637
 ] 

Christian Moen commented on LUCENE-4433:


Any thoughts if we should backport this - or just a fix for the specific case 
mention - to the 3.6 branch, Robert?

I'm happy to do it, but I'm not sure if there will be a 3.6.2 with 4.0 being so 
close.


>  kuromoji  ToStringUtil.getRomanization
> ---
>
> Key: LUCENE-4433
> URL: https://issues.apache.org/jira/browse/LUCENE-4433
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 3.6.1
>Reporter: Wang Han
>
> case 'メ':
>   builder.append("mi");
>   break;
> -
> should be 
> case 'メ':
>   builder.append("me");
>   break;
> you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463611#comment-13463611
 ] 

Christian Moen edited comment on LUCENE-4433 at 9/26/12 7:19 PM:
-

Robert has already fixed this on {{trunk}} in {{r1339753}}.

  was (Author: cm):
Robert has already fixed this on {{trunk}} in {{r1339753}.
  
>  kuromoji  ToStringUtil.getRomanization
> ---
>
> Key: LUCENE-4433
> URL: https://issues.apache.org/jira/browse/LUCENE-4433
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 3.6.1
>Reporter: Wang Han
>
> case 'メ':
>   builder.append("mi");
>   break;
> -
> should be 
> case 'メ':
>   builder.append("me");
>   break;
> you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-4433:
---

  Component/s: modules/analysis
Affects Version/s: 3.6.1

>  kuromoji  ToStringUtil.getRomanization
> ---
>
> Key: LUCENE-4433
> URL: https://issues.apache.org/jira/browse/LUCENE-4433
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 3.6.1
>Reporter: Wang Han
>
> case 'メ':
>   builder.append("mi");
>   break;
> -
> should be 
> case 'メ':
>   builder.append("me");
>   break;
> you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463611#comment-13463611
 ] 

Christian Moen commented on LUCENE-4433:


Robert has already fixed this on {{trunk}} in {{r1339753}.

>  kuromoji  ToStringUtil.getRomanization
> ---
>
> Key: LUCENE-4433
> URL: https://issues.apache.org/jira/browse/LUCENE-4433
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Wang Han
>
> case 'メ':
>   builder.append("mi");
>   break;
> -
> should be 
> case 'メ':
>   builder.append("me");
>   break;
> you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463595#comment-13463595
 ] 

Christian Moen commented on LUCENE-4433:


Thanks a lot for this.  I'll fix.

>  kuromoji  ToStringUtil.getRomanization
> ---
>
> Key: LUCENE-4433
> URL: https://issues.apache.org/jira/browse/LUCENE-4433
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Wang Han
>
> case 'メ':
>   builder.append("mi");
>   break;
> -
> should be 
> case 'メ':
>   builder.append("me");
>   break;
> you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated SOLR-3876:
-

Fix Version/s: (was: 4.0)
   4.1

> Solr Admin UI is completely dysfunctional on IE 9
> -
>
> Key: SOLR-3876
> URL: https://issues.apache.org/jira/browse/SOLR-3876
> Project: Solr
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 4.0-BETA, 4.0
> Environment: Windows 7, IE 9
>Reporter: Jack Krupansky
>Priority: Critical
> Fix For: 4.1
>
> Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg
>
>
> The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
> shot. I don't even see a "collection1" button. But Admin UI is working fine 
> in Google Chrome with same running instance of Solr.
> Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461889#comment-13461889
 ] 

Christian Moen commented on SOLR-3876:
--

The 4.0 UI wasn't developed with IE9 in mind so getting IE9 supported seems 
like a bigger effort.  SOLR-3841 seems related to this issue and has been 
deferred to 4.1 so I'm suggesting that we do the same with this one as well.

Please feel free to jump in with whatever comments you might have, steffkes.

> Solr Admin UI is completely dysfunctional on IE 9
> -
>
> Key: SOLR-3876
> URL: https://issues.apache.org/jira/browse/SOLR-3876
> Project: Solr
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 4.0-BETA, 4.0
> Environment: Windows 7, IE 9
>Reporter: Jack Krupansky
>Priority: Critical
> Fix For: 4.1
>
> Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg
>
>
> The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
> shot. I don't even see a "collection1" button. But Admin UI is working fine 
> in Google Chrome with same running instance of Solr.
> Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461860#comment-13461860
 ] 

Christian Moen edited comment on SOLR-3876 at 9/25/12 2:54 AM:
---

Thanks a lot for this, Jack.

I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, 
but do you happen to know if this is a regression of it the UI has been 
generally broken for IE9 all along?

To me it sounds quite important to get this fixed for 4.0 if it's a regression. 
 I can help working some on this.

  was (Author: cm):
Thanks a lot for this, Jack.

I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, 
but do you happen to know if this is a regression of it the UI has been 
generally broken for IE9 all along?

To me it sounds quite important to get this fixed for 4.0 and I can help 
working some on this.
  
> Solr Admin UI is completely dysfunctional on IE 9
> -
>
> Key: SOLR-3876
> URL: https://issues.apache.org/jira/browse/SOLR-3876
> Project: Solr
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 4.0-BETA, 4.0
> Environment: Windows 7, IE 9
>Reporter: Jack Krupansky
>Priority: Critical
> Fix For: 4.0
>
> Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg
>
>
> The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
> shot. I don't even see a "collection1" button. But Admin UI is working fine 
> in Google Chrome with same running instance of Solr.
> Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461860#comment-13461860
 ] 

Christian Moen commented on SOLR-3876:
--

Thanks a lot for this, Jack.

I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, 
but do you happen to know if this is a regression of it the UI has been 
generally broken for IE9 all along?

To me it sounds quite important to get this fixed for 4.0 and I can help 
working some on this.

> Solr Admin UI is completely dysfunctional on IE 9
> -
>
> Key: SOLR-3876
> URL: https://issues.apache.org/jira/browse/SOLR-3876
> Project: Solr
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 4.0-BETA, 4.0
> Environment: Windows 7, IE 9
>Reporter: Jack Krupansky
>Priority: Critical
> Fix For: 4.0
>
> Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg
>
>
> The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
> shot. I don't even see a "collection1" button. But Admin UI is working fine 
> in Google Chrome with same running instance of Solr.
> Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4330) Add NAIST-jdic support to Kuromoji

2012-08-27 Thread Christian Moen (JIRA)
Christian Moen created LUCENE-4330:
--

 Summary: Add NAIST-jdic support to Kuromoji
 Key: LUCENE-4330
 URL: https://issues.apache.org/jira/browse/LUCENE-4330
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 5.0, 4.0
Reporter: Christian Moen


We should look into adding NAIST-jdic support to Kuromoji as this dictionary is 
better than the current IPADIC.  The NAIST-jdic license seems fine, but needs a 
formal check-off before any inclusion in Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-07-30 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425488#comment-13425488
 ] 

Christian Moen commented on LUCENE-3922:


I've attached a work-in-progress patch for {{trunk}} that implements a 
{{CharFilter}} that normalizes Japanese numbers.

These are some TODOs and implementation considerations I have that I'd be 
thankful to get feedback on:

* Buffering the entire input on the first read should be avoided.  The primary 
reason this is done is because I was thinking to add some regexps before and 
after kanji numeric strings to qualify their normalization, i.e. to only 
normalize strings that starts with ¥, JPY or ends with 円, to only normalize 
monetary amounts in Japanese yen.  However, this probably isn't necessary as we 
can probably can use {{Matcher.requireEnd()}} and {{Matcher.hitEnd()}} to 
decide if we need to read more input. (Thanks, Robert!)

* Is qualifying the numbers to be normalized with prefix and suffix regexps 
useful, i.e. to only normalize monetary amounts?

* How do we deal with leading zeros?  Currently, "007" and "◯◯七" becomes "7" 
today.  Do we want an option to preserve leading zeros?

* How large numbers do we care about supporting?  Some of the larger numbers 
are surrogates, which complicates implementation, but they're certainly 
possible.  If we don't care about really large numbers, we can probably be fine 
working with {{long}} instead of {{BigInteger}}.

* Polite numbers and some other variants aren't supported, i.e. 壱, 弐, 参, etc., 
but they can easily be added.  We can also add the obsolete variants if that's 
useful somehow.  Are these useful?  Do we want them available via an option?

* Number formats such as "1億2,345万6,789" isn't supported - we don't deal with 
the comma today, but this can be added.  The same applies to "12 345" where 
there's a space that separates thousands like in French.  Numbers like "2・2兆" 
aren't supported, but can be added.

* Only integers are supported today, so we can't parse "〇・一二三四", which becomes 
"0" and "1234" as separate tokens instead of "0.1234"

There are probably other considerations, too, that I doesn't immediately come 
to mind.

Numbers are fairly complicated and feedback on direction for further 
implementation is most appreciated.  Thanks.

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-07-30 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---

Attachment: LUCENE-3922.patch

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412685#comment-13412685
 ] 

Christian Moen commented on SOLR-3524:
--

{{CHANGES.txt}} for some reason didn't make it into {{branch_4x}}.  Fixed this 
in revision 1360622.

> Make discard-punctuation feature in Kuromoji configurable from 
> JapaneseTokenizerFactory
> ---
>
> Key: SOLR-3524
> URL: https://issues.apache.org/jira/browse/SOLR-3524
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Affects Versions: 3.6
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: SOLR-3524.patch, SOLR-3524.patch, 
> kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
> punctuation in Japanese text, although It has a parameter to change this 
> behavior.  JapaneseTokenizerFactory always set third parameter, which 
> controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype 
> definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen resolved SOLR-3524.
--

   Resolution: Fixed
Fix Version/s: 5.0
   4.0

Thanks, Kazu and Ohtani-san!

> Make discard-punctuation feature in Kuromoji configurable from 
> JapaneseTokenizerFactory
> ---
>
> Key: SOLR-3524
> URL: https://issues.apache.org/jira/browse/SOLR-3524
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Affects Versions: 3.6
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: SOLR-3524.patch, SOLR-3524.patch, 
> kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
> punctuation in Japanese text, although It has a parameter to change this 
> behavior.  JapaneseTokenizerFactory always set third parameter, which 
> controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype 
> definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412659#comment-13412659
 ] 

Christian Moen commented on SOLR-3524:
--

Committed revision 1360613 on {{branch_4x}}

> Make discard-punctuation feature in Kuromoji configurable from 
> JapaneseTokenizerFactory
> ---
>
> Key: SOLR-3524
> URL: https://issues.apache.org/jira/browse/SOLR-3524
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Affects Versions: 3.6
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: SOLR-3524.patch, SOLR-3524.patch, 
> kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
> punctuation in Japanese text, although It has a parameter to change this 
> behavior.  JapaneseTokenizerFactory always set third parameter, which 
> controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype 
> definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412628#comment-13412628
 ] 

Christian Moen commented on SOLR-3524:
--

Committed revision 1360592 on {{trunk}}

> Make discard-punctuation feature in Kuromoji configurable from 
> JapaneseTokenizerFactory
> ---
>
> Key: SOLR-3524
> URL: https://issues.apache.org/jira/browse/SOLR-3524
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Affects Versions: 3.6
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Minor
> Attachments: SOLR-3524.patch, SOLR-3524.patch, 
> kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
> punctuation in Japanese text, although It has a parameter to change this 
> behavior.  JapaneseTokenizerFactory always set third parameter, which 
> controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype 
> definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412627#comment-13412627
 ] 

Christian Moen commented on SOLR-3524:
--

Patch updated due to recent configuration changes.

> Make discard-punctuation feature in Kuromoji configurable from 
> JapaneseTokenizerFactory
> ---
>
> Key: SOLR-3524
> URL: https://issues.apache.org/jira/browse/SOLR-3524
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Affects Versions: 3.6
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Minor
> Attachments: SOLR-3524.patch, SOLR-3524.patch, 
> kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
> punctuation in Japanese text, although It has a parameter to change this 
> behavior.  JapaneseTokenizerFactory always set third parameter, which 
> controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype 
> definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated SOLR-3524:
-

Attachment: SOLR-3524.patch

> Make discard-punctuation feature in Kuromoji configurable from 
> JapaneseTokenizerFactory
> ---
>
> Key: SOLR-3524
> URL: https://issues.apache.org/jira/browse/SOLR-3524
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Affects Versions: 3.6
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Minor
> Attachments: SOLR-3524.patch, SOLR-3524.patch, 
> kuromoji_discard_punctuation.patch.txt
>
>
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
> punctuation in Japanese text, although It has a parameter to change this 
> behavior.  JapaneseTokenizerFactory always set third parameter, which 
> controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype 
> definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4207) speed up our slowest tests

2012-07-11 Thread Christian Moen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411654#comment-13411654
 ] 

Christian Moen commented on LUCENE-4207:


Thanks a lot, Dawid.  I'll try this, have a look and report back.

Adrien, thanks for taking the time!

> speed up our slowest tests
> --
>
> Key: LUCENE-4207
> URL: https://issues.apache.org/jira/browse/LUCENE-4207
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Robert Muir
>
> Was surprised to hear from Christian that lucene/solr tests take him 40 
> minutes on a modern mac.
> This is too much. Lets look at the slowest tests and make them reasonable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >