[jira] [Commented] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers
[ https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918464#comment-16918464 ] Christian Moen commented on LUCENE-8959: Sounds like a good idea. This is also rather big rabbit hole... Would it be useful to consider making the digit grouping separators configurable as part of a bigger scheme here? In Japanese, if you're processing text with SI numbers, I believe space is a valid digit grouping. > JapaneseNumberFilter does not take whitespaces into account when > concatenating numbers > -- > > Key: LUCENE-8959 > URL: https://issues.apache.org/jira/browse/LUCENE-8959 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > Today the JapaneseNumberFilter tries to concatenate numbers even if they are > separated by whitespaces. So for instance "10 100" is rewritten into "10100" > even if the tokenizer doesn't discard punctuations. In practice this is not > an issue but this can lead to giant number of tokens if there are a lot of > numbers separated by spaces. The number of concatenation should be > configurable with a sane default limit in order to avoid creating big tokens > that slows down the analysis. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859673#comment-16859673 ] Christian Moen commented on LUCENE-8817: Thanks, [~tomoko]. I don't think we should any "mecab" in the naming. Please let me elaborate a bit. Kuromoji can read MeCab format models, but Kuromoji isn't a port of MeCab. Kuromoji has been developed independently without inspecting or reviewing any MeCab source code. This was an initial goal of the project to make sure we could use an Apache License. The MeCab and Kuromoji feature sets are quite different and I think users will find it confusing if they expect MeCab and find that Kuromoji is much more limited. I'm also unsure if Kudo-san will appreciate that we make an association by name like this. It certainly doesn't give due credit to MeCab, in my opinion, which is a much more extensive project. In terms of naming, what about using "statistical" instead of "mecab" for this class of analyzers? I'm thinking "Viterbi" could be good to refer to in shared tokenizer code. This said, I think it could be a good to refer to "mecab" in the dictionary compiler code, documentation, etc. to make sure users understand that we can read this model format. Any thoughts? > Combine Nori and Kuromoji DictionaryBuilder > --- > > Key: LUCENE-8817 > URL: https://issues.apache.org/jira/browse/LUCENE-8817 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > > This issue is related to LUCENE-8816. > Currently Nori and Kuromoji Analyzer use the same dictionary structure. > (MeCab) > If we make combine DictionaryBuilder, we can reduce the code size. > But this task may have a dependency on the language. > (like HEADER string in BinaryDictionary and CharacterDefinition, methods in > BinaryDictionaryWriter, ...) > On the other hand, there are many overlapped classes. > The purpose of this patch is to provide users of Nori and Kuromoji with the > same system dictionary generator. > It may take some time because there is a little workload. > The work will be based on the latest master, and if the LUCENE-8816 is > finished first, I will pull the latest code and proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852551#comment-16852551 ] Christian Moen commented on LUCENE-8816: Separating out the dictionaries is a great idea. [~rcmuir] made great efforts making the original dictionary tiny and some assumptions were made based on the value ranges of the original source data. To me it sounds like a good idea to keep the Japanese and Korean dictionaries separately initially and consider combining them later on when implications of such combination is clear. I agree with [~jim.ferenczi]. > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8752) Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' (REIWA)
[ https://issues.apache.org/jira/browse/LUCENE-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814467#comment-16814467 ] Christian Moen commented on LUCENE-8752: Thanks a lot, [~Tomoko Uchida]. > Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' > (REIWA) > - > > Key: LUCENE-8752 > URL: https://issues.apache.org/jira/browse/LUCENE-8752 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > > As of May 1st, 2019, Japanese era '元号' (Gengo) will be set to '令和' (Reiwa). > See this article for more details: > [https://www.bbc.com/news/world-asia-47769566] > Currently '令和' is splitted up to '令' and '和' by {{JapaneseTokenizer}}. It > should be tokenized as one word so that Japanese texts including era names > are searched as users expect. Because the default Kuromoji dictionary > (mecab-ipadic) has not been maintained since 2007, a one-line patch to the > source CSV file is needed for this era change. > Era name is used in many official or formal documents in Japan, so it would > be desirable the search systems properly handle this without adding a user > dictionary or using phrase query. :) > FYI, JDK DateTime API will support the new era (in the next updates.) > [https://blogs.oracle.com/java-platform-group/a-new-japanese-era-for-java] > The patch is available here: > [https://github.com/apache/lucene-solr/pull/632] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8752) Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' (REIWA)
[ https://issues.apache.org/jira/browse/LUCENE-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811412#comment-16811412 ] Christian Moen commented on LUCENE-8752: Thanks for this, [~Tomoko Uchida]. I think it's a good idea to make this change. I'll follow early next week. > Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' > (REIWA) > - > > Key: LUCENE-8752 > URL: https://issues.apache.org/jira/browse/LUCENE-8752 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Minor > > As of May 1st, 2019, Japanese era '元号' (Gengo) will be set to '令和' (Reiwa). > See this article for more details: > [https://www.bbc.com/news/world-asia-47769566] > Currently '令和' is splitted up to '令' and '和' by {{JapaneseTokenizer}}. It > should be tokenized as one word so that Japanese texts including era names > are searched as users expect. Because the default Kuromoji dictionary > (mecab-ipadic) has not been maintained since 2007, a one-line patch to the > source CSV file is needed for this era change. > Era name is used in many official or formal documents in Japan, so it would > be desirable the search systems properly handle this without adding a user > dictionary or using phrase query. :) > FYI, JDK DateTime API will support the new era (in the next updates.) > [https://blogs.oracle.com/java-platform-group/a-new-japanese-era-for-java] > The patch is available here: > [https://github.com/apache/lucene-solr/pull/632] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-7992) Kuromoji fails with UnsupportedOperationException in case of duplicate keys in the user dictionary
[ https://issues.apache.org/jira/browse/LUCENE-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen reassigned LUCENE-7992: -- Assignee: Christian Moen > Kuromoji fails with UnsupportedOperationException in case of duplicate keys > in the user dictionary > -- > > Key: LUCENE-7992 > URL: https://issues.apache.org/jira/browse/LUCENE-7992 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Christian Moen >Priority: Minor > > Failing is the right thing to do but the exception could clarify the source > of the problem. Today it just throws an UnsupportedOperationException with no > error message because of a call to PositiveIntOutputs.merge. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7992) Kuromoji fails with UnsupportedOperationException in case of duplicate keys in the user dictionary
[ https://issues.apache.org/jira/browse/LUCENE-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202098#comment-16202098 ] Christian Moen commented on LUCENE-7992: Thanks, Adrien. I'll have a look. > Kuromoji fails with UnsupportedOperationException in case of duplicate keys > in the user dictionary > -- > > Key: LUCENE-7992 > URL: https://issues.apache.org/jira/browse/LUCENE-7992 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > Failing is the right thing to do but the exception could clarify the source > of the problem. Today it just throws an UnsupportedOperationException with no > error message because of a call to PositiveIntOutputs.merge. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-7181) JapaneseTokenizer: Validate segmentation of User Dictionary entries on creation
[ https://issues.apache.org/jira/browse/LUCENE-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen reassigned LUCENE-7181: -- Assignee: Christian Moen > JapaneseTokenizer: Validate segmentation of User Dictionary entries on > creation > --- > > Key: LUCENE-7181 > URL: https://issues.apache.org/jira/browse/LUCENE-7181 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomás Fernández Löbbe >Assignee: Christian Moen > Attachments: LUCENE-7181.patch > > > From the [conversation on the dev > list|http://mail-archives.apache.org/mod_mbox/lucene-dev/201604.mbox/%3CCAMJgJxR8gLnXi7WXkN3KFfxHu=posevxxarbbg+chce1tzh...@mail.gmail.com%3E] > The user dictionary in the {{JapaneseTokenizer}} allows users to customize > how a stream is broken into tokens using a specific set of rules provided > like: > AABBBCC -> AA BBB CC > It does not allow users to change any of the token characters like: > (1) AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", > seems to only care about positions) > It also doesn't let a character be part of more than one token, like: > (2) AABBBCC -> AAB BBB BCC (this will throw an AIOOBE) > ..or make the output token bigger than the input text: > (3) AA -> AAA (Also AIOOBE) > Currently there is no validation for those cases, case 1 doesn't fail but > provide unexpected tokens. Cases 2 and 3 fail when the input text is > analyzed. We should add validation to the {{UserDictionary}} creation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093096#comment-15093096 ] Christian Moen commented on LUCENE-6837: Hello Mike, Yes, I'd like to backport this to 5.5. > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837 for 5.4.zip, LUCENE-6837.patch, > LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029762#comment-15029762 ] Christian Moen commented on LUCENE-6837: Thanks a lot, Konno-san. Things look good. My apologies that I couldn't look into this earlier. I've attached a new patch where I've included your fix and also renamed some methods. I think it's getting ready... > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, > LUCENE-6837.patch, LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-6837: --- Attachment: LUCENE-6837.patch > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, > LUCENE-6837.patch, LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010673#comment-15010673 ] Christian Moen commented on LUCENE-6837: Tokenizing Japanese Wikipedia seems fine with nBestCost set, but it seems like random-blasting doesn't pass. Konno-san, I'm wondering if I can ask you the trouble of looking into why the {{testRandomHugeStrings}} fails with the latest patch? The test basically does random-blasting with nBestCost set to 2000. I think it's a good idea that we fix this before we commit. I believe it's easily reproducible, but I used {noformat} ant test -Dtestcase=TestJapaneseTokenizer -Dtests.method=testRandomHugeStrings -Dtests.seed=99EB179B92E66345 -Dtests.slow=true -Dtests.locale=sr_CS -Dtests.timezone=PNT -Dtests.asserts=true -Dtests.file.encoding=US-ASCII {noformat} in my environment. > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-6837: --- Attachment: LUCENE-6837.patch > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995604#comment-14995604 ] Christian Moen commented on LUCENE-6837: I've attached a new patch with some minor changes: * Made the {{System.out.printf}} being subject to VERBOSE being true * Introduced RuntimeException to deal with the initialization error cases * Renamed the new parameters to {{nBestCost}} and {{nBestExamples}} * Added additional javadoc here and there to document the new functionality I'm planning on running some stability tests with the new tokenizer parameters next. > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837.patch, LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-6837: --- Attachment: LUCENE-6837.patch > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837.patch, LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen reassigned LUCENE-6837: -- Assignee: Christian Moen > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Assignee: Christian Moen >Priority: Minor > Attachments: LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978156#comment-14978156 ] Christian Moen commented on LUCENE-6837: Thanks a lot for this, Konno-san. Very nice work! I like the idea to calculate the n-best cost using examples. Since search mode and also extended mode solves a similar problem, I'm wondering if it makes sense to introduce n-best as a separate mode in itself. In your experience in developing the feature, do you think it makes a lot of sense to use it with search and extended mode? I think I'm in favour of supporting it for all the modes, even though it perhaps makes the most sense for normal mode. The reason for this is to make sure that the entire API for {{JapaneseTokenizer}} is functional for all the tokenizer modes. I'll add a few tests and I'd like to commit this soon. > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Priority: Minor > Attachments: LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954487#comment-14954487 ] Christian Moen commented on LUCENE-6837: Thanks. I've had a very quick look at the code and have some comments and questions. I'm happy to take care of this, Koji. > Add N-best output capability to JapaneseTokenizer > - > > Key: LUCENE-6837 > URL: https://issues.apache.org/jira/browse/LUCENE-6837 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 5.3 >Reporter: KONNO, Hiroharu >Priority: Minor > Attachments: LUCENE-6837.patch > > > Japanese morphological analyzers often generate mis-segmented tokens. N-best > output reduces the impact of mis-segmentation on search result. N-best output > is more meaningful than character N-gram, and it increases hit count too. > If you use N-best output, you can get decompounded tokens (ex: > "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and > overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), > depending on the dictionary and N-best parameter settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6733) Incorrect URL causes build break - analysis/kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692824#comment-14692824 ] Christian Moen commented on LUCENE-6733: Thanks, I'll have a look. > Incorrect URL causes build break - analysis/kuromoji > > > Key: LUCENE-6733 > URL: https://issues.apache.org/jira/browse/LUCENE-6733 > Project: Lucene - Core > Issue Type: Bug > Components: general/build, modules/analysis >Affects Versions: 5.2.1 > Environment: n/a >Reporter: Susumu Fukuda >Priority: Minor > Attachments: LUCENE-6733.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Ivy.xml contains dictionary URL both of IPADIC and NAIST-JDIC. > But there’re already gone. No existing. So it causes build break at > download-dict task. > Google Code will be closed soon later. And SouceForge(.jp not .net) was moved > osdn.jp. > Fumm… not sure how I can attach a patch file… I can’t find a field. later ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-6468) Empty kuromoji user dictionary -> NPE
[ https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen resolved LUCENE-6468. Resolution: Fixed Fix Version/s: 5.x Trunk > Empty kuromoji user dictionary -> NPE > - > > Key: LUCENE-6468 > URL: https://issues.apache.org/jira/browse/LUCENE-6468 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Assignee: Christian Moen > Fix For: Trunk, 5.x > > Attachments: LUCENE-6468.patch > > > Kuromoji user dictionary takes Reader and allows for comments and other lines > to be ignored. But if its "empty" in the sense of no actual entries, the > returned FST will be null, and it will throw a confusing NPE. > JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary > as having none at all, so I think the best fix is to fix the UserDictionary > api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, > and return null if the FST is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6468) Empty kuromoji user dictionary -> NPE
[ https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537767#comment-14537767 ] Christian Moen commented on LUCENE-6468: Thanks, Ohtani-san! I added a {{final}} being required for {{branch_5x}} for JDK 1.7 and also changed the empty user dictionary test to contain a user dictionary with a comment and some newlines (it's still empty, though). I've committed your patch to {{trunk}} and {{branch_5x}}. > Empty kuromoji user dictionary -> NPE > - > > Key: LUCENE-6468 > URL: https://issues.apache.org/jira/browse/LUCENE-6468 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Assignee: Christian Moen > Attachments: LUCENE-6468.patch > > > Kuromoji user dictionary takes Reader and allows for comments and other lines > to be ignored. But if its "empty" in the sense of no actual entries, the > returned FST will be null, and it will throw a confusing NPE. > JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary > as having none at all, so I think the best fix is to fix the UserDictionary > api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, > and return null if the FST is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-6468) Empty kuromoji user dictionary -> NPE
[ https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen reassigned LUCENE-6468: -- Assignee: Christian Moen > Empty kuromoji user dictionary -> NPE > - > > Key: LUCENE-6468 > URL: https://issues.apache.org/jira/browse/LUCENE-6468 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Assignee: Christian Moen > Attachments: LUCENE-6468.patch > > > Kuromoji user dictionary takes Reader and allows for comments and other lines > to be ignored. But if its "empty" in the sense of no actual entries, the > returned FST will be null, and it will throw a confusing NPE. > JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary > as having none at all, so I think the best fix is to fix the UserDictionary > api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, > and return null if the FST is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6468) Empty kuromoji user dictionary -> NPE
[ https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532226#comment-14532226 ] Christian Moen commented on LUCENE-6468: Good catch. I can look into a patch for this. > Empty kuromoji user dictionary -> NPE > - > > Key: LUCENE-6468 > URL: https://issues.apache.org/jira/browse/LUCENE-6468 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir > > Kuromoji user dictionary takes Reader and allows for comments and other lines > to be ignored. But if its "empty" in the sense of no actual entries, the > returned FST will be null, and it will throw a confusing NPE. > JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary > as having none at all, so I think the best fix is to fix the UserDictionary > api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, > and return null if the FST is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream
[ https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304342#comment-14304342 ] Christian Moen commented on LUCENE-6216: Thanks, Robert. I had the same idea and I tried this out last night. The advantage of the approach is that we only read the buffer data for the token attributes we use, but it leaves the API a bit slightly awkward in my opinion since we would have both a {{setToken()}} and a {{setPartOfSpeech()}}. That said, this is still perhaps the best way to go for performance reasons and these APIs being very low-level and not commonly used. For the sake of exploring an alternative idea; a different approach could be to have separate token filters set these attributes. The tokenizer would set a {{CharTermAttribute}}, etc. and a {{JapaneseTokenAttribute}} (or something suitably named) that holds the {{Token}}. A separate {{JapanesePartOfSpeechFilter}} would be responsible for setting the {{PartOfSpeechAttribute}} by getting the data from the {{JapaneseTokenAttribute}} using a {{getToken()}} method. We'd still need logic similar to the above to deal with {{setPartOfSpeech()}}, etc. so I don't think we gain anything by taking this approach, and it's a big change, too. > Make it easier to modify Japanese token attributes downstream > - > > Key: LUCENE-6216 > URL: https://issues.apache.org/jira/browse/LUCENE-6216 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Christian Moen >Priority: Minor > > Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, > {{BaseFormAttribute}}, etc. get their values from a > {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method. > This makes it cumbersome to change these token attributes later on in the > analysis chain since the {{Token}} instances are difficult to instantiate > (sort of read-only objects). > I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would > be appropriate to update token attributes to also reflect Japanese number > normalization. > I think it might be more practical to allow setting a specific value for > these token attributes directly rather than through a {{Token}} since it > makes the APIs simpler, allows for easier changing attributes downstream, and > also supporting additional dictionaries easier. > The drawback with the approach that I can think of is a performance hit as we > will miss out on the inherent lazy retrieval of these token attributes from > the {{Token}} object (and the underlying dictionary/buffer). > I'd like to do some testing to better understand the performance impact of > this change. Happy to hear your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream
Christian Moen created LUCENE-6216: -- Summary: Make it easier to modify Japanese token attributes downstream Key: LUCENE-6216 URL: https://issues.apache.org/jira/browse/LUCENE-6216 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Christian Moen Priority: Minor Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, {{BaseFormAttribute}}, etc. get their values from a {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method. This makes it cumbersome to change these token attributes later on in the analysis chain since the {{Token}} instances are difficult to instantiate (sort of read-only objects). I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would be appropriate to update token attributes to also reflect Japanese number normalization. I think it might be more practical to allow setting a specific value for these token attributes directly rather than through a {{Token}} since it makes the APIs simpler, allows for easier changing attributes downstream, and also supporting additional dictionaries easier. The drawback with the approach that I can think of is a performance hit as we will miss out on the inherent lazy retrieval of these token attributes from the {{Token}} object (and the underlying dictionary/buffer). I'd like to do some testing to better understand the performance impact of this change. Happy to hear your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch Minor updates to javadoc. I'll leave reading attributes, etc. unchanged for now and get back to resolving this once we have better mechanisms in place for updating some of the Japanese token attributes downstream. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Fix For: 5.1 > > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch Updated patch with decimal number support, additional javadoc and the test code now makes precommit happy. Token-attributes such as part-of-speech, readings, etc. for the normalized token is currently inherited from the last token used when composing the normalized number. Since these values are likely to be wrong, I'm inclined to set this attributes to null or a reasonable default. I'm very happy to hear your thoughts on this. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Fix For: 5.1 > > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Fix Version/s: 5.1 > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Fix For: 5.1 > > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch New patch with CHANGES.txt and services entry. Will do some end-to-end testing next. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296379#comment-14296379 ] Christian Moen commented on LUCENE-3922: Please feel free to test it. Feedback is very welcome. The patch is against {{trunk}} and this should make it into 5.1. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch Added factory and wrote javadoc. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173567#comment-14173567 ] Christian Moen commented on LUCENE-3922: Gaute and myself have done testing on real-world data and we've uncovered and fixed a couple of corner-case issues. Our todo items are as follows: # Do additional testing and possible add additional number formats # Document some unsupported cases in unit-tests # Add class-level javadoc # Add a Solr factory > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164954#comment-14164954 ] Christian Moen commented on LUCENE-3922: I've attached a new patch. The {{checkRandomData}} issues were caused by improper handling of token composition for graphs (bug found by [~gaute]). Tokens preceded by position increment zero token are left untouched and so are stacked/synonym tokens. We'll do some more testing and add some documentation before we move forward to commit this. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085909#comment-14085909 ] Christian Moen commented on LUCENE-3922: Gaute and myself have been doing some work on this and we have rewritten this as a {{TokenFilter}}. A few comments: * We have added support for numbers such as 3.2兆円 as you requested, Kazu. * We could potentially use a POS-tag attribute from Kuromoji to identify number that we are composing, but perhaps not relying on POS-tags makes this filter also useful in the case of n-gramming. * We haven't implemented any of the anchoring logic discussed above, i.e. if we to restrict normalization to prices, etc. Is this useful to have? * Input such as {{1,5}} becomes {{15}} after normalization, which could be undesired. Is this bad input or do we want anchoring to retain these numbers? One thing though, in order to support some of this number parsing, i.e. cases such as 3.2兆円, we need to use Kuromoji in a mode that retains punctuation characters. There's also an unresolved issue found by {{checkRandomData}} that we haven't tracked down and fixed, yet. This is a work in progress and feedback is welcome. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901180#comment-13901180 ] Christian Moen commented on SOLR-1301: -- I've been reading through (pretty much all) the comments on this JIRA and I'd like to thank you all for the great effort you have put into this. > Add a Solr contrib that allows for building Solr indexes via Hadoop's > Map-Reduce. > - > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: New Feature >Reporter: Andrzej Bialecki >Assignee: Mark Miller > Fix For: 5.0, 4.7 > > Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, > SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, > hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, > log4j-1.2.15.jar > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > -- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-N directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
[ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820093#comment-13820093 ] Christian Moen commented on LUCENE-2899: bq. Stuff like this NER should NOT be in the analysis chain. as i said, its more useful in the "document build" phase anyway. +1 Benson, as far as I understand, ES doesn't have the concept by design. > Add OpenNLP Analysis capabilities as a module > - > > Key: LUCENE-2899 > URL: https://issues.apache.org/jira/browse/LUCENE-2899 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 4.6 > > Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, > OpenNLPFilter.java, OpenNLPTokenizer.java > > > Now that OpenNLP is an ASF project and has a nice license, it would be nice > to have a submodule (under analysis) that exposed capabilities for it. Drew > Farris, Tom Morton and I have code that does: > * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it > would have to change slightly to buffer tokens) > * NamedEntity recognition as a TokenFilter > We are also planning a Tokenizer/TokenFilter that can put parts of speech as > either payloads (PartOfSpeechAttribute?) on a token or at the same position. > I'd propose it go under: > modules/analysis/opennlp -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13796545#comment-13796545 ] Christian Moen commented on LUCENE-4956: SooMyung, I've committed the latest changes we merged in Seoul on Monday. It's great if you can fix the decompounding issue we came across, which we disabled a test for. Uwe, +1 to use {{Class#getResourceAsStream}} and remove {{FileUtils}} and {{JarResources}}. I'll make these changes and commit to the branch. Overall, I think there's a lot of things we can do to improve this code. Would very much like to hear your opinion on what we should fix before committing to trunk and getting this on the 4.x branch and improve from there. My thinking is that it might be good to get this committed so we'll have Korean working even though the code needs some work. SooMyung has a community in Korea that uses and it's serving their needs as far as I understand. Happy to hear people's opinion on this. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, > LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794046#comment-13794046 ] Christian Moen commented on LUCENE-4956: Soomyung and myself met up in Seoul today and we've merged his latest locally. I'll commit the changes to this branch when I'm back in Tokyo and Soomyung will follow up with fixing a known issue afterwards. Hopefully we can commit this to trunk very soon. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, > LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789944#comment-13789944 ] Christian Moen commented on LUCENE-4956: Thanks a lot. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, > LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789052#comment-13789052 ] Christian Moen commented on LUCENE-4956: SooMyung, The patch you uploaded on September 11th, was that made against the latest {{lucene4956}} branch? The patch doesn't apply properly against on {{lucene4956}} for me. Could you clarify its origin and instruct me how it can be applied? If you can make a patch against the code on {{lucene4956}}, that would be much appreciated. Thanks! > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, > LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786905#comment-13786905 ] Christian Moen commented on LUCENE-4956: Thanks for pushing me on this. I'll have a look at your recent changes and commit to trunk shortly if everything seems fine. I hope to have this committed to trunk early next week. Sorry for this having dragged out. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, > LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5244) NPE in Japanese Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777560#comment-13777560 ] Christian Moen commented on LUCENE-5244: Hello Benson, In your code on Github, try calling {{tokenStream.reset()}} before consumption. > NPE in Japanese Analyzer > > > Key: LUCENE-5244 > URL: https://issues.apache.org/jira/browse/LUCENE-5244 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.4 >Reporter: Benson Margulies > > I've got a test case that shows an NPE with the Japanese analyzer. > It's all available in https://github.com/benson-basis/kuromoji-npe, and I > explicitly grant a license to the Foundation. > If anyone would prefer that I attach a tarball here, just let me know. > {noformat} > --- > T E S T S > --- > Running com.basistech.testcase.JapaneseNpeTest > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.298 sec <<< > FAILURE! - in com.basistech.testcase.JapaneseNpeTest > japaneseNpe(com.basistech.testcase.JapaneseNpeTest) Time elapsed: 0.282 sec > <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.lucene.analysis.util.RollingCharBuffer.get(RollingCharBuffer.java:86) > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:618) > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:468) > at > com.basistech.testcase.JapaneseNpeTest.japaneseNpe(JapaneseNpeTest.java:28) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739392#comment-13739392 ] Christian Moen commented on LUCENE-4956: SooMyung, let's sync up regarding your latest changes (the patch you attached). I'm thinking perhaps we can merge to {{trunk}} first and iterate from there. Thanks. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739387#comment-13739387 ] Christian Moen commented on LUCENE-4956: Attaching a patch against {{trunk}} (r1513348). > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-4956: --- Attachment: LUCENE-4956.patch > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739301#comment-13739301 ] Christian Moen commented on LUCENE-4956: I've now aligned the branch with {{trunk}}, updated the example {{schema.xml}} to use {{text_ko}} naming for the Korean field type. I've also indexed Korean Wikipedia continuously for a few hours and the JVM heap looks fine. There are several additional things that can be done with this code, including generating the parser using JFlex at build time, fixing some of the position issues with random-blasting, cleanups and dead-code removal, etc. This said, I believe the code we have is useful to Korean users as-is and I'm thinking it's a good idea to integrate it into {{trunk}} and iterate further from there. Please share your thoughts. Thanks. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704147#comment-13704147 ] Christian Moen commented on LUCENE-4956: Hello SooMyung, I'm the one who haven't followed up properly on this as I've been too bogged down with other things. I've set aside time next week to work on this and I hope to have Korean merged and integrated with {{trunk}} then. I'm not sure we can make 4.4, but I'm willing to put in extra effort if there's a chance we can get it in in time. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken
[ https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694148#comment-13694148 ] Christian Moen commented on SOLR-4945: -- No, it's not. {{JapaneseTokenizerFactory}} is available in 3.6 or newer. Kindly upgrade to the latest version of Solr (currently 4.3.1) and see if the problem persists. If it does, please indicate how you reproduced it in detail so we can start investigating the cause. Thanks. > Japanese Autocomplete and Highlighter broken > > > Key: SOLR-4945 > URL: https://issues.apache.org/jira/browse/SOLR-4945 > Project: Solr > Issue Type: Bug > Components: highlighter >Reporter: Shruthi Khatawkar > > Autocomplete is implemented with Highlighter functionality. This works fine > for most of the languages but breaks for Japanese. > multivalued,termVector,termPositions and termOffset are set to true. > Here is an example: > Query: product classic. > Result: > Actual : > この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic > Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > With Highlighter ( tags being used): > この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 > USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > Though query terms "product classic" is repeated twice, highlighting is > happening only on the first instance. As shown above. > Solr returns only first instance offset and second instance is ignored. > Also it's observed, highlighter repeats first letter of the token if there is > numeric. > For eg.Query : product and We have product1, highlighter returns as > pproduct1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken
[ https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693894#comment-13693894 ] Christian Moen commented on SOLR-4945: -- Hello Shruthi, Could you confirm if you see this problem when using {{JapaneseTokenizerFactory}}? {{SenTokenizerFactory}} isn't part of Solr and if you are seeing funny offsets there, that could be the root cause of this. This is my speculation only -- I really don't know... I believe {{JapaneseTokenizerFactory}} in normal mode gives a similar segmentation to {{SenTokenizer}} and it would be good to see if we can reproduce this using {{JapaneseTokenizerFactory}}. Many thanks. > Japanese Autocomplete and Highlighter broken > > > Key: SOLR-4945 > URL: https://issues.apache.org/jira/browse/SOLR-4945 > Project: Solr > Issue Type: Bug > Components: highlighter >Reporter: Shruthi Khatawkar > > Autocomplete is implemented with Highlighter functionality. This works fine > for most of the languages but breaks for Japanese. > multivalued,termVector,termPositions and termOffset are set to true. > Here is an example: > Query: product classic. > Result: > Actual : > この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic > Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > With Highlighter ( tags being used): > この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 > USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > Though query terms "product classic" is repeated twice, highlighting is > happening only on the first instance. As shown above. > Solr returns only first instance offset and second instance is ignored. > Also it's observed, highlighter repeats first letter of the token if there is > numeric. > For eg.Query : product and We have product1, highlighter returns as > pproduct1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken
[ https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13690131#comment-13690131 ] Christian Moen commented on SOLR-4945: -- Hello Shruthi, Does this have anything to do with autocomplete or is this solely a highlighting issue? Which field type are you using? Are you using JapaneseTokenizer as part of this field type with search mode turned on? Thanks. > Japanese Autocomplete and Highlighter broken > > > Key: SOLR-4945 > URL: https://issues.apache.org/jira/browse/SOLR-4945 > Project: Solr > Issue Type: Bug > Components: highlighter >Reporter: Shruthi Khatawkar > > Autocomplete is implemented with Highlighter functionality. This works fine > for most of the languages but breaks for Japanese. > multivalued,termVector,termPositions and termOffset are set to true. > Here is an example: > Query: product classic. > Result: > Actual : > この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic > Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > With Highlighter ( tags being used): > この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 > USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > Though query terms "product classic" is repeated twice, highlighting is > happening only on the first instance. As shown above. > Solr returns only first instance offset and second instance is ignored. > Also it's observed, highlighter repeats first letter of the token if there is > numeric. > For eg.Query : product and We have product1, highlighter returns as > pproduct1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5013) ScandinavianInterintelligableASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664383#comment-13664383 ] Christian Moen commented on LUCENE-5013: bq. Though maybe the name could be simpler, (ScandinavianNormalizationFilter?) +1 > ScandinavianInterintelligableASCIIFoldingFilter > --- > > Key: LUCENE-5013 > URL: https://issues.apache.org/jira/browse/LUCENE-5013 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.3 >Reporter: Karl Wettin >Priority: Trivial > Attachments: LUCENE-5013.txt > > > This filter is an augmentation of output from ASCIIFoldingFilter, > it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the > first one. > blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj > räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas > Caveats: > Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been > folded down to aoaoae when handled by this filter it will cause effects such > as: > bøen -> boen -> bon > åene -> aene -> ane > I find this to be a trivial problem compared to not finding anything at all. > Background: > Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus > interchangeable in when used between these languages. They are however folded > differently when people type them on a keyboard lacking these characters and > ASCIIFoldingFilter handle ä and æ differently. > When a Swedish person is lacking umlauted characters on the keyboard they > consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, > a, o. > In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use > a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark > but the pattern is probably the same. > This filter solves that problem, but might also cause new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664228#comment-13664228 ] Christian Moen commented on LUCENE-4956: Thanks a lot! > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664223#comment-13664223 ] Christian Moen commented on LUCENE-4956: I'm happy to take care of this unless you want to do it, Steve. I can do this either tomorrow or on Friday. Thanks. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar, lucene4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660528#comment-13660528 ] Christian Moen commented on LUCENE-4956: I've run {{KoreanAnalyzer}} on Korean Wikipedia and also had a look at memory/heap usage. Things look okay overall. I believe {{KoreanFilter}} uses wrong offsets for synonym tokens, which was discovered by random-blasting. Looking into the issue... > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4813) Unavoidable IllegalArgumentException occurs when SynonymFilterFactory's setting has tokenizer factory's parameter.
[ https://issues.apache.org/jira/browse/SOLR-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13657937#comment-13657937 ] Christian Moen commented on SOLR-4813: -- Good work. Thanks! > Unavoidable IllegalArgumentException occurs when SynonymFilterFactory's > setting has tokenizer factory's parameter. > -- > > Key: SOLR-4813 > URL: https://issues.apache.org/jira/browse/SOLR-4813 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis >Affects Versions: 4.3 >Reporter: Shingo Sasaki >Assignee: Hoss Man >Priority: Critical > Labels: SynonymFilterFactory > Fix For: 5.0, 4.4, 4.3.1 > > Attachments: SOLR-4813__4x.patch, SOLR-4813.patch, SOLR-4813.patch > > > When I write SynonymFilterFactory' setting in schema.xml as follows, ... > {code:xml} > >minGramSize="2"/> >ignoreCase="true" expand="true" >tokenizerFactory="solr.NGramTokenizerFactory" maxGramSize="2" > minGramSize="2"/> > > {code} > IllegalArgumentException ("Unknown parameters") occurs. > {noformat} > Caused by: java.lang.IllegalArgumentException: Unknown parameters: > {maxGramSize=2, minGramSize=2} > at > org.apache.lucene.analysis.synonym.FSTSynonymFilterFactory.(FSTSynonymFilterFactory.java:71) > at > org.apache.lucene.analysis.synonym.SynonymFilterFactory.(SynonymFilterFactory.java:50) > ... 28 more > {noformat} > However TokenizerFactory's params should be set to loadTokenizerFactory > method in [FST|Slow]SynonymFilterFactory. (ref. SOLR-2909) > I think, the problem was caused by LUCENE-4877 ("Fix analyzer factories to > throw exception when arguments are invalid") and SOLR-3402 ("Parse Version > outside of Analysis Factories"). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656919#comment-13656919 ] Christian Moen commented on LUCENE-4956: Hello SooMyung, Thanks for the above regarding field type. The general approach we have taken in Lucene is to do the same analysis at both index and query side. For example, the Japanese analyzer also has functionality to do compound splitting and we've discussed doing this one the index side only per default for field type {{text_ja}}, but we decided against it. I've included your field type in the latest code I've checked in just now, but it's likely that we will change this in the future. I'm wondering if you could help me with a few sample sentences that illustrates the various options {{KoreanFilter}} has. I'd like to add some test-cases for these to better understand the differences between them and to verify correct behaviour. Test-cases for this is also a useful way to document functionality in general. Thanks for any help with this! > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656901#comment-13656901 ] Christian Moen commented on LUCENE-4956: Thanks, Steve & co.! > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651826#comment-13651826 ] Christian Moen commented on LUCENE-4956: Updates: * Added {{text_kr}} field type to {{schema.xml}} * Fixed Solr factories to load field type {{text_kr}} in the example * Updated javadoc so that it compiles cleanly (mostly removed illegal javadoc) * Updated various build things related to include Korean in the Solr distribution * Added placeholder stopwords file * Added services for arirang Korean analysis using field type {{text_kr}} seems to be doing the right thing out-of-the-box now, but some configuration options in the factories aren't working as of now. There are several other things that needs polishing up, but we're making progress. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649833#comment-13649833 ] Christian Moen commented on LUCENE-4956: bq. I think we're ready for the incubator-general vote. [~cm], do you agree? +1 > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649379#comment-13649379 ] Christian Moen commented on LUCENE-4956: Good points, Uwe. I'll look into this. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649296#comment-13649296 ] Christian Moen commented on LUCENE-4956: Thanks, Steve. I've added the missing license header to {{TestKoreanAnalyzer.java}}. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649260#comment-13649260 ] Christian Moen commented on LUCENE-4956: I've created branch {{lucene4956}} and checked in an {{arirang}} module in {{lucene/analysis}}. I've added a basic test that tests segmentation, offsets, etc. Other updates: * Some compilation warnings related to generics have been fixed, but several are to go. * License headers have been added to all source code files * Author tags have been removed from all files, except {{StringUtils}} pending SooMyoung's feedback (see above) * Added IntelliJ IDEA config to make {{ant idea}} set things up correctly. Eclipse is TODO. My next step is to fix the compilation related warning altogether and once we confirmed {{StringUtils}}, I think we can do the incubator-general vote. I'll keep you posted. I think we should also consider rewriting and optimise some of the code here and there, but that's for later. It's great if you can be involved in this process, SooMyoung! I'll probably need your help and good advice here and there. :) > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649258#comment-13649258 ] Christian Moen commented on LUCENE-4956: Hello SooMyoung, Could you comment about the origins and authorship of {{org.apache.lucene.analysis.kr.utils.StringUtil}} in your tar file? I'm seeing a lot of authors in this file. Is this from Apache Commons Lang? Thanks! > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen reassigned LUCENE-4956: -- Assignee: Christian Moen > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee >Assignee: Christian Moen > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649125#comment-13649125 ] Christian Moen commented on LUCENE-4956: A quick status update on my side is as follows: I've put the code into an a module called {{arirang}} on my local setup and made a few changes necessary to make things work on {{trunk}}. {{KoreanAnalyzer}} now produces Korean tokens and some tests I've made passes when run from my IDE. Loading the dictionaries as resources need some work and I'll spend time on this during the weekend. I'll also address the headers, etc. to prepare for the incubator-general vote. Hopefully, I'll have all this on a branch this weekend. I'll keep you posted and we can take things from there. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646960#comment-13646960 ] Christian Moen commented on LUCENE-4956: SooMyung, I don't think you need to do anything at this point. I think a good next step is that we create a new branch and check the code you have submitted onto that branch. We can then start looking into addressing the headers and other items that people have pointed out in comments. (Thanks, Jack and Edward!) Steve, will there be a vote after the code has been checked onto the branch? If you think the above is a good next step, I'm happy to start working on this either later this week or next week. Kindly let me know you prefer to proceed. Thanks. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643901#comment-13643901 ] Christian Moen commented on LUCENE-4956: The Korean analyzer should be named named {{org.apache.lucene.analysis.kr.KoreanAnalyzer}} and we'll provide a ready-to-use field type {{text_kr}} in {{schema.xml}} for Solr users, which is consistent with what we do for other languages. As for where the analyzer code itself lives, I think it's fine to put it in {{lucene/analysis/arirang}}. The file {{lucene/analysis/README.txt}} documents what these modules are and the code is easily and directly retrievable in IDEs by looking up {{KoreanAnalyzer}} (the source code paths will be set up by {{ant eclipse}} and {{ant idea}}). One reason analyzers have not been put in {{lucene/analysis/common} in the past is that they require dictionaries that are several megabytes large. Overall, I don't think the scheme we are using is all that problematic, but it's true that {{MorfologikAnalyzer}} and {{SmartChineseAnalyzer}} doesn't align with it. The scheme doesn't easily lend itself to different implementations for one language, but that's not a common case today although it might become more common in the future. In the case of Norwegian (no), there are ISO language codes for both Bokmål (bm) and Nynorsk (nn), and one way of supporting this is also to consider these as options to {{NorwegianAnalyzer}} since both languages are Norwegian. See SOLR-4565 for thoughts on how to extend support in {{NorwegianMinimalStemFilter}} for this. A similar overall approach might make sense when there are multiple implementations of a language; end-users can use a analyzer named {{Analyzer}} without requiring users to study the difference in implementation before using. I also see problems with this, but it's just a thought... I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean? > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641365#comment-13641365 ] Christian Moen commented on LUCENE-4956: Thanks again, SooMyung! I'm seeing that Steven has informed you about the grant process on the mailing list. I'm happy to also facilitate this process with Steven. Looking forward to getting Korean supported. > the korean analyzer that has a korean morphological analyzer and dictionaries > - > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.2 >Reporter: SooMyung Lee > Labels: newbie > Attachments: kr.analyzer.4x.tar > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata
[ https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640551#comment-13640551 ] Christian Moen commented on LUCENE-4947: Kevin, I think it's best that you do the license change yourself and that we don't have any active role in making the change since you are the only person entitled to make the change. This change can be done by using the below header on all the source code and other relevant text files: {noformat} /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ {noformat} After this has been done, please make a tarball and attach it to this JIRA and indicate that this is the code you wish to grant and also inform us about the MD5 hash of the tarball. (This will go into the IP-clearance document and will be used to identify the codebase.) It's a good idea to also use this MD5 hash as part of Exhibit A in the [software-grant.txt|http://www.apache.org/licenses/software-grant.txt] agreement unless you have signed and submitted this already. (If you donate the code yourself by attaching it to the JIRA as described above, I believe the hashes not being part of Exhibit A is acceptable.) Please feel free to add your comments, Steve. > Java implementation (and improvement) of Levenshtein & associated lexicon > automata > -- > > Key: LUCENE-4947 > URL: https://issues.apache.org/jira/browse/LUCENE-4947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1 >Reporter: Kevin Lawson > > I was encouraged by Mike McCandless to open an issue concerning this after I > contacted him privately about it. Thanks Mike! > I'd like to submit my Java implementation of the Levenshtein Automaton as a > homogenous replacement for the current heterogenous, multi-component > implementation in Lucene. > Benefits of upgrading include > - Reduced code complexity > - Better performance from components that were previously implemented in > Python > - Support for on-the-fly dictionary-automaton manipulation (if you wish to > use my dictionary-automaton implementation) > The code for all the components is well structured, easy to follow, and > extensively commented. It has also been fully tested for correct > functionality and performance. > The levenshtein automaton implementation (along with the required MDAG > reference) can be found in my LevenshteinAutomaton Java library here: > https://github.com/klawson88/LevenshteinAutomaton. > The minimalistic directed acyclic graph (MDAG) which the automaton code uses > to store and step through word sets can be found here: > https://github.com/klawson88/MDAG > *Transpositions aren't currently implemented. I hope the comment filled, > editing-friendly code combined with the fact that the section in the Mihov > paper detailing transpositions is only 2 pages makes adding the functionality > trivial. > *As a result of support for on-the-fly manipulation, the MDAG > (dictionary-automaton) creation process incurs a slight speed penalty. In > order to have the best of both worlds, i'd recommend the addition of a > constructor which only takes sorted input. The complete, easy to follow > pseudo-code for the simple procedure can be found in the first article I > linked under the references section in the MDAG repository) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata
[ https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638128#comment-13638128 ] Christian Moen commented on LUCENE-4947: It sounds proper to do a code grant also because the software currently has a GPL license. Thanks for following up, Steve. > Java implementation (and improvement) of Levenshtein & associated lexicon > automata > -- > > Key: LUCENE-4947 > URL: https://issues.apache.org/jira/browse/LUCENE-4947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1 >Reporter: Kevin Lawson > > I was encouraged by Mike McCandless to open an issue concerning this after I > contacted him privately about it. Thanks Mike! > I'd like to submit my Java implementation of the Levenshtein Automaton as a > homogenous replacement for the current heterogenous, multi-component > implementation in Lucene. > Benefits of upgrading include > - Reduced code complexity > - Better performance from components that were previously implemented in > Python > - Support for on-the-fly dictionary-automaton manipulation (if you wish to > use my dictionary-automaton implementation) > The code for all the components is well structured, easy to follow, and > extensively commented. It has also been fully tested for correct > functionality and performance. > The levenshtein automaton implementation (along with the required MDAG > reference) can be found in my LevenshteinAutomaton Java library here: > https://github.com/klawson88/LevenshteinAutomaton. > The minimalistic directed acyclic graph (MDAG) which the automaton code uses > to store and step through word sets can be found here: > https://github.com/klawson88/MDAG > *Transpositions aren't currently implemented. I hope the comment filled, > editing-friendly code combined with the fact that the section in the Mihov > paper detailing transpositions is only 2 pages makes adding the functionality > trivial. > *As a result of support for on-the-fly manipulation, the MDAG > (dictionary-automaton) creation process incurs a slight speed penalty. In > order to have the best of both worlds, i'd recommend the addition of a > constructor which only takes sorted input. The complete, easy to follow > pseudo-code for the simple procedure can be found in the first article I > linked under the references section in the MDAG repository) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein & associated lexicon automata
[ https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637943#comment-13637943 ] Christian Moen commented on LUCENE-4947: Thanks a lot for wishing to submit code! It's not possible to include your code in Lucene if it has a GPL license. Quite frankly, I don't think even think Lucene committers can even have a look at it to consider it for inclusion with a GPL license. If you have written all the code or otherwise own all copyrights, would you mind switching to Apache License 2.0? That way, I at least think it would be possible to have a close look to see if this is a good fit for Lucene. > Java implementation (and improvement) of Levenshtein & associated lexicon > automata > -- > > Key: LUCENE-4947 > URL: https://issues.apache.org/jira/browse/LUCENE-4947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1 >Reporter: Kevin Lawson > > I was encouraged by Mike McCandless to open an issue concerning this after I > contacted him privately about it. Thanks Mike! > I'd like to submit my Java implementation of the Levenshtein Automaton as a > homogenous replacement for the current heterogenous, multi-component > implementation in Lucene. > Benefits of upgrading include > - Reduced code complexity > - Better performance from components that were previously implemented in > Python > - Support for on-the-fly dictionary-automaton manipulation (if you wish to > use my dictionary-automaton implementation) > The code for all the components is well structured, easy to follow, and > extensively commented. It has also been fully tested for correct > functionality and performance. > The levenshtein automaton implementation (along with the required MDAG > reference) can be found in my LevenshteinAutomaton Java library here: > https://github.com/klawson88/LevenshteinAutomaton. > The minimalistic directed acyclic graph (MDAG) which the automaton code uses > to store and step through word sets can be found here: > https://github.com/klawson88/MDAG > *Transpositions aren't currently implemented. I hope the comment filled, > editing-friendly code combined with the fact that the section in the Mihov > paper detailing transpositions is only 2 pages makes adding the functionality > trivial. > *As a result of support for on-the-fly manipulation, the MDAG > (dictionary-automaton) creation process incurs a slight speed penalty. In > order to have the best of both worlds, i'd recommend the addition of a > constructor which only takes sorted input. The complete, easy to follow > pseudo-code for the simple procedure can be found in the first article I > linked under the references section in the MDAG repository) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3706) Ship setup to log with log4j.
[ https://issues.apache.org/jira/browse/SOLR-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603516#comment-13603516 ] Christian Moen commented on SOLR-3706: -- {quote} Mark, have you tried Logback? That's a good logging implementation; arguably a better one. {quote} David and Mark, I believe [Log4J 2|http://logging.apache.org/log4j/2.x/|] addresses a lot of the weaknesses in Log4J 1.x also addressed by Logback. However, Log4J 2 hasn't been released yet. To me it sounds like a good idea to use Log4J 1.x now and move to Log4J 2 in the future. > Ship setup to log with log4j. > - > > Key: SOLR-3706 > URL: https://issues.apache.org/jira/browse/SOLR-3706 > Project: Solr > Issue Type: Improvement >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Minor > Fix For: 4.3, 5.0 > > Attachments: SOLR-3706-solr-log4j.patch > > > Currently we default to java util logging and it's terrible in my opinion. > *It's simple built in logger is a 2 line logger. > *You have to jump through hoops to use your own custom formatter with jetty - > either putting your class in the start.jar or other pain in the butt > solutions. > *It can't roll files by date out of the box. > I'm sure there are more issues, but those are the ones annoying me now. We > should switch to log4j - it's much nicer and it's easy to get a nice single > line format and roll by date, etc. > If someone wants to use JUL they still can - but at least users could start > with something decent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4407) SSL auth or basic auth in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572524#comment-13572524 ] Christian Moen commented on SOLR-4407: -- Thanks a lot for clarifying, Jan. I wasn't aware of this limitation. > SSL auth or basic auth in SolrCloud > --- > > Key: SOLR-4407 > URL: https://issues.apache.org/jira/browse/SOLR-4407 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.1 >Reporter: Sindre Fiskaa > Labels: Authentication, Certificate, SSL > Fix For: 4.2, 5.0 > > > I need to be able to secure sensitive information in solrnodes running in a > SolrCloud with either SSL client/server certificates or http basic auth.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4407) SSL auth or basic auth in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572507#comment-13572507 ] Christian Moen commented on SOLR-4407: -- I don't think this is a Solr issue, but it might be helpful to provide general information on how to secure Solr's interfaces. However, how to set this up is Servlet container specific. Could you clarify what you had in mind for this? Thanks. > SSL auth or basic auth in SolrCloud > --- > > Key: SOLR-4407 > URL: https://issues.apache.org/jira/browse/SOLR-4407 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.1 >Reporter: Sindre Fiskaa > Labels: Authentication, Certificate, SSL > > I need to be able to secure sensitive information in solrnodes running in a > SolrCloud with either SSL client/server certificates or http basic auth.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474287#comment-13474287 ] Christian Moen commented on LUCENE-3922: Ohtani-san, I saw your tweet about this earlier and it sounds like a very good idea. Thanks. I will try to set aside some time to work on this. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474224#comment-13474224 ] Christian Moen commented on LUCENE-3922: Thanks, Kazu. I'm aware of the issue and the thinking is to rework this as a {{TokenFilter}} and use anchoring options with surrounding tokens to decide if normalisation should take place, i.e. if the preceding token is ¥ or the following token is 円 in the case of normalising prices. It might also be helpful to look into using POS-info for this to benefit from what we actually know about the token, i.e. to not apply normalisation if the POS tag is a person name. Other suggestions and ideas are of course most welcome. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471132#comment-13471132 ] Christian Moen commented on LUCENE-3922: {quote} Is it difficult to support numbers with period as the following? 3.2兆円 5.2億円 {quote} Supporting this is no problem and a good idea. {quote} I think It would be helpful that this charfilter supports old Kanji numeric characters ("KYU-KANJI" or "DAIJI") such as 壱, 壹 (One), 弌, 弐, 貳 (Two), 弍, 参,參 (Three), or configureable. {quote} This is also easy to support. As for making preserving zeros configurable, that's also possible, of course. It's great to get more feedback on what sort of functionality we need and what should be configurable options. Hopefully, we can find a good balance without adding too much complexity. Thanks for the feedback. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470967#comment-13470967 ] Christian Moen commented on LUCENE-3921: Lance, The idea I had in mind for Japanese uses language specific characteristics for katakana terms and perhaps weights that are dictionary-specific as well. However, we are hacking the our statistical model here and there are limitations as to how far we can go with this. I don't know a whole lot about the Smart Chinese toolkit, but I believe the same approach to compound segmentation could work for Chinese as well. However, weights and implementation would likely to be separate. Note that the above is really about one specific kind of compound segmentation that applies to Japanese so the thinking was to add additional heuristics for this specific type that is particularly tricky. It might be a good idea to approach this problem also using the {{DictionaryCompoundWordTokenFilter}} and collectively build some lexical assets for compound splitting for the relevant languages than hacking our models. > Add decompose compound Japanese Katakana token capability to Kuromoji > - > > Key: LUCENE-3921 > URL: https://issues.apache.org/jira/browse/LUCENE-3921 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 4.0-ALPHA > Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe" >Reporter: Kazuaki Hiraga > Labels: features > > Japanese morphological analyzer, Kuromoji doesn't have a capability to > decompose every Japanese Katakana compound tokens to sub-tokens. It seems > that some Katakana tokens can be decomposed, but it cannot be applied every > Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" > don't decompose into "トート バッグ" and "ショルダー バッグ" although the IPA dictionary > has "バッグ" in its entry. I would like to apply the decompose feature to every > Katakana tokens if the sub-tokens are in the dictionary or add the capability > to force apply the decompose feature to every Katakana tokens. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization
[ https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463637#comment-13463637 ] Christian Moen commented on LUCENE-4433: Any thoughts if we should backport this - or just a fix for the specific case mention - to the 3.6 branch, Robert? I'm happy to do it, but I'm not sure if there will be a 3.6.2 with 4.0 being so close. > kuromoji ToStringUtil.getRomanization > --- > > Key: LUCENE-4433 > URL: https://issues.apache.org/jira/browse/LUCENE-4433 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 3.6.1 >Reporter: Wang Han > > case 'メ': > builder.append("mi"); > break; > - > should be > case 'メ': > builder.append("me"); > break; > you can refer http://en.wikipedia.org/wiki/Katakana -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4433) kuromoji ToStringUtil.getRomanization
[ https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463611#comment-13463611 ] Christian Moen edited comment on LUCENE-4433 at 9/26/12 7:19 PM: - Robert has already fixed this on {{trunk}} in {{r1339753}}. was (Author: cm): Robert has already fixed this on {{trunk}} in {{r1339753}. > kuromoji ToStringUtil.getRomanization > --- > > Key: LUCENE-4433 > URL: https://issues.apache.org/jira/browse/LUCENE-4433 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 3.6.1 >Reporter: Wang Han > > case 'メ': > builder.append("mi"); > break; > - > should be > case 'メ': > builder.append("me"); > break; > you can refer http://en.wikipedia.org/wiki/Katakana -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4433) kuromoji ToStringUtil.getRomanization
[ https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-4433: --- Component/s: modules/analysis Affects Version/s: 3.6.1 > kuromoji ToStringUtil.getRomanization > --- > > Key: LUCENE-4433 > URL: https://issues.apache.org/jira/browse/LUCENE-4433 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 3.6.1 >Reporter: Wang Han > > case 'メ': > builder.append("mi"); > break; > - > should be > case 'メ': > builder.append("me"); > break; > you can refer http://en.wikipedia.org/wiki/Katakana -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization
[ https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463611#comment-13463611 ] Christian Moen commented on LUCENE-4433: Robert has already fixed this on {{trunk}} in {{r1339753}. > kuromoji ToStringUtil.getRomanization > --- > > Key: LUCENE-4433 > URL: https://issues.apache.org/jira/browse/LUCENE-4433 > Project: Lucene - Core > Issue Type: Bug >Reporter: Wang Han > > case 'メ': > builder.append("mi"); > break; > - > should be > case 'メ': > builder.append("me"); > break; > you can refer http://en.wikipedia.org/wiki/Katakana -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization
[ https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463595#comment-13463595 ] Christian Moen commented on LUCENE-4433: Thanks a lot for this. I'll fix. > kuromoji ToStringUtil.getRomanization > --- > > Key: LUCENE-4433 > URL: https://issues.apache.org/jira/browse/LUCENE-4433 > Project: Lucene - Core > Issue Type: Bug >Reporter: Wang Han > > case 'メ': > builder.append("mi"); > break; > - > should be > case 'メ': > builder.append("me"); > break; > you can refer http://en.wikipedia.org/wiki/Katakana -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9
[ https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated SOLR-3876: - Fix Version/s: (was: 4.0) 4.1 > Solr Admin UI is completely dysfunctional on IE 9 > - > > Key: SOLR-3876 > URL: https://issues.apache.org/jira/browse/SOLR-3876 > Project: Solr > Issue Type: Bug > Components: web gui >Affects Versions: 4.0-BETA, 4.0 > Environment: Windows 7, IE 9 >Reporter: Jack Krupansky >Priority: Critical > Fix For: 4.1 > > Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg > > > The Solr Admin UI is completely dysfunctional on IE 9. See attached screen > shot. I don't even see a "collection1" button. But Admin UI is working fine > in Google Chrome with same running instance of Solr. > Currently running 4.0 RC0, but problem existed with 4.0-BETA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9
[ https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461889#comment-13461889 ] Christian Moen commented on SOLR-3876: -- The 4.0 UI wasn't developed with IE9 in mind so getting IE9 supported seems like a bigger effort. SOLR-3841 seems related to this issue and has been deferred to 4.1 so I'm suggesting that we do the same with this one as well. Please feel free to jump in with whatever comments you might have, steffkes. > Solr Admin UI is completely dysfunctional on IE 9 > - > > Key: SOLR-3876 > URL: https://issues.apache.org/jira/browse/SOLR-3876 > Project: Solr > Issue Type: Bug > Components: web gui >Affects Versions: 4.0-BETA, 4.0 > Environment: Windows 7, IE 9 >Reporter: Jack Krupansky >Priority: Critical > Fix For: 4.1 > > Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg > > > The Solr Admin UI is completely dysfunctional on IE 9. See attached screen > shot. I don't even see a "collection1" button. But Admin UI is working fine > in Google Chrome with same running instance of Solr. > Currently running 4.0 RC0, but problem existed with 4.0-BETA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9
[ https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461860#comment-13461860 ] Christian Moen edited comment on SOLR-3876 at 9/25/12 2:54 AM: --- Thanks a lot for this, Jack. I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, but do you happen to know if this is a regression of it the UI has been generally broken for IE9 all along? To me it sounds quite important to get this fixed for 4.0 if it's a regression. I can help working some on this. was (Author: cm): Thanks a lot for this, Jack. I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, but do you happen to know if this is a regression of it the UI has been generally broken for IE9 all along? To me it sounds quite important to get this fixed for 4.0 and I can help working some on this. > Solr Admin UI is completely dysfunctional on IE 9 > - > > Key: SOLR-3876 > URL: https://issues.apache.org/jira/browse/SOLR-3876 > Project: Solr > Issue Type: Bug > Components: web gui >Affects Versions: 4.0-BETA, 4.0 > Environment: Windows 7, IE 9 >Reporter: Jack Krupansky >Priority: Critical > Fix For: 4.0 > > Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg > > > The Solr Admin UI is completely dysfunctional on IE 9. See attached screen > shot. I don't even see a "collection1" button. But Admin UI is working fine > in Google Chrome with same running instance of Solr. > Currently running 4.0 RC0, but problem existed with 4.0-BETA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9
[ https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461860#comment-13461860 ] Christian Moen commented on SOLR-3876: -- Thanks a lot for this, Jack. I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, but do you happen to know if this is a regression of it the UI has been generally broken for IE9 all along? To me it sounds quite important to get this fixed for 4.0 and I can help working some on this. > Solr Admin UI is completely dysfunctional on IE 9 > - > > Key: SOLR-3876 > URL: https://issues.apache.org/jira/browse/SOLR-3876 > Project: Solr > Issue Type: Bug > Components: web gui >Affects Versions: 4.0-BETA, 4.0 > Environment: Windows 7, IE 9 >Reporter: Jack Krupansky >Priority: Critical > Fix For: 4.0 > > Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg > > > The Solr Admin UI is completely dysfunctional on IE 9. See attached screen > shot. I don't even see a "collection1" button. But Admin UI is working fine > in Google Chrome with same running instance of Solr. > Currently running 4.0 RC0, but problem existed with 4.0-BETA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4330) Add NAIST-jdic support to Kuromoji
Christian Moen created LUCENE-4330: -- Summary: Add NAIST-jdic support to Kuromoji Key: LUCENE-4330 URL: https://issues.apache.org/jira/browse/LUCENE-4330 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 5.0, 4.0 Reporter: Christian Moen We should look into adding NAIST-jdic support to Kuromoji as this dictionary is better than the current IPADIC. The NAIST-jdic license seems fine, but needs a formal check-off before any inclusion in Lucene. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425488#comment-13425488 ] Christian Moen commented on LUCENE-3922: I've attached a work-in-progress patch for {{trunk}} that implements a {{CharFilter}} that normalizes Japanese numbers. These are some TODOs and implementation considerations I have that I'd be thankful to get feedback on: * Buffering the entire input on the first read should be avoided. The primary reason this is done is because I was thinking to add some regexps before and after kanji numeric strings to qualify their normalization, i.e. to only normalize strings that starts with ¥, JPY or ends with 円, to only normalize monetary amounts in Japanese yen. However, this probably isn't necessary as we can probably can use {{Matcher.requireEnd()}} and {{Matcher.hitEnd()}} to decide if we need to read more input. (Thanks, Robert!) * Is qualifying the numbers to be normalized with prefix and suffix regexps useful, i.e. to only normalize monetary amounts? * How do we deal with leading zeros? Currently, "007" and "◯◯七" becomes "7" today. Do we want an option to preserve leading zeros? * How large numbers do we care about supporting? Some of the larger numbers are surrogates, which complicates implementation, but they're certainly possible. If we don't care about really large numbers, we can probably be fine working with {{long}} instead of {{BigInteger}}. * Polite numbers and some other variants aren't supported, i.e. 壱, 弐, 参, etc., but they can easily be added. We can also add the obsolete variants if that's useful somehow. Are these useful? Do we want them available via an option? * Number formats such as "1億2,345万6,789" isn't supported - we don't deal with the comma today, but this can be added. The same applies to "12 345" where there's a space that separates thousands like in French. Numbers like "2・2兆" aren't supported, but can be added. * Only integers are supported today, so we can't parse "〇・一二三四", which becomes "0" and "1234" as separate tokens instead of "0.1234" There are probably other considerations, too, that I doesn't immediately come to mind. Numbers are fairly complicated and feedback on direction for further implementation is most appreciated. Thanks. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-3922: --- Attachment: LUCENE-3922.patch > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412685#comment-13412685 ] Christian Moen commented on SOLR-3524: -- {{CHANGES.txt}} for some reason didn't make it into {{branch_4x}}. Fixed this in revision 1360622. > Make discard-punctuation feature in Kuromoji configurable from > JapaneseTokenizerFactory > --- > > Key: SOLR-3524 > URL: https://issues.apache.org/jira/browse/SOLR-3524 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Affects Versions: 3.6 >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Minor > Fix For: 4.0, 5.0 > > Attachments: SOLR-3524.patch, SOLR-3524.patch, > kuromoji_discard_punctuation.patch.txt > > > JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve > punctuation in Japanese text, although It has a parameter to change this > behavior. JapaneseTokenizerFactory always set third parameter, which > controls this behavior, to true to remove punctuation. > I would like to have an option I can configure this behavior by fieldtype > definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen resolved SOLR-3524. -- Resolution: Fixed Fix Version/s: 5.0 4.0 Thanks, Kazu and Ohtani-san! > Make discard-punctuation feature in Kuromoji configurable from > JapaneseTokenizerFactory > --- > > Key: SOLR-3524 > URL: https://issues.apache.org/jira/browse/SOLR-3524 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Affects Versions: 3.6 >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Minor > Fix For: 4.0, 5.0 > > Attachments: SOLR-3524.patch, SOLR-3524.patch, > kuromoji_discard_punctuation.patch.txt > > > JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve > punctuation in Japanese text, although It has a parameter to change this > behavior. JapaneseTokenizerFactory always set third parameter, which > controls this behavior, to true to remove punctuation. > I would like to have an option I can configure this behavior by fieldtype > definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412659#comment-13412659 ] Christian Moen commented on SOLR-3524: -- Committed revision 1360613 on {{branch_4x}} > Make discard-punctuation feature in Kuromoji configurable from > JapaneseTokenizerFactory > --- > > Key: SOLR-3524 > URL: https://issues.apache.org/jira/browse/SOLR-3524 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Affects Versions: 3.6 >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Minor > Fix For: 4.0, 5.0 > > Attachments: SOLR-3524.patch, SOLR-3524.patch, > kuromoji_discard_punctuation.patch.txt > > > JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve > punctuation in Japanese text, although It has a parameter to change this > behavior. JapaneseTokenizerFactory always set third parameter, which > controls this behavior, to true to remove punctuation. > I would like to have an option I can configure this behavior by fieldtype > definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412628#comment-13412628 ] Christian Moen commented on SOLR-3524: -- Committed revision 1360592 on {{trunk}} > Make discard-punctuation feature in Kuromoji configurable from > JapaneseTokenizerFactory > --- > > Key: SOLR-3524 > URL: https://issues.apache.org/jira/browse/SOLR-3524 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Affects Versions: 3.6 >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Minor > Attachments: SOLR-3524.patch, SOLR-3524.patch, > kuromoji_discard_punctuation.patch.txt > > > JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve > punctuation in Japanese text, although It has a parameter to change this > behavior. JapaneseTokenizerFactory always set third parameter, which > controls this behavior, to true to remove punctuation. > I would like to have an option I can configure this behavior by fieldtype > definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412627#comment-13412627 ] Christian Moen commented on SOLR-3524: -- Patch updated due to recent configuration changes. > Make discard-punctuation feature in Kuromoji configurable from > JapaneseTokenizerFactory > --- > > Key: SOLR-3524 > URL: https://issues.apache.org/jira/browse/SOLR-3524 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Affects Versions: 3.6 >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Minor > Attachments: SOLR-3524.patch, SOLR-3524.patch, > kuromoji_discard_punctuation.patch.txt > > > JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve > punctuation in Japanese text, although It has a parameter to change this > behavior. JapaneseTokenizerFactory always set third parameter, which > controls this behavior, to true to remove punctuation. > I would like to have an option I can configure this behavior by fieldtype > definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated SOLR-3524: - Attachment: SOLR-3524.patch > Make discard-punctuation feature in Kuromoji configurable from > JapaneseTokenizerFactory > --- > > Key: SOLR-3524 > URL: https://issues.apache.org/jira/browse/SOLR-3524 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Affects Versions: 3.6 >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Minor > Attachments: SOLR-3524.patch, SOLR-3524.patch, > kuromoji_discard_punctuation.patch.txt > > > JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve > punctuation in Japanese text, although It has a parameter to change this > behavior. JapaneseTokenizerFactory always set third parameter, which > controls this behavior, to true to remove punctuation. > I would like to have an option I can configure this behavior by fieldtype > definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4207) speed up our slowest tests
[ https://issues.apache.org/jira/browse/LUCENE-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411654#comment-13411654 ] Christian Moen commented on LUCENE-4207: Thanks a lot, Dawid. I'll try this, have a look and report back. Adrien, thanks for taking the time! > speed up our slowest tests > -- > > Key: LUCENE-4207 > URL: https://issues.apache.org/jira/browse/LUCENE-4207 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > > Was surprised to hear from Christian that lucene/solr tests take him 40 > minutes on a modern mac. > This is too much. Lets look at the slowest tests and make them reasonable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org