[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694286#comment-16694286 ] Kazuaki Hiraga commented on LUCENE-3922: I have confirmed that there are still some normalization issues that incorrectly normalize Kanji numerals. However, implementation itself has been finished and merged into the main branch. Thus, I will close this ticket and file another ticket to report normalization issues and send patches. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Major > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689577#comment-16689577 ] Mike Sokolov commented on LUCENE-3922: -- +1 - this was merged ages ago (2015); would be nice to clean up the Jira so folks looking for interesting projects don't get diverted :) > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Major > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638360#comment-16638360 ] ankush jhalani commented on LUCENE-3922: I noticed the changes are available in master/branch_7x ([Test]JapaneseNumberFilter[Factory].java) Should we mark this closed? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Major > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392489#comment-14392489 ] Ramkumar Aiyengar commented on LUCENE-3922: --- [~cm], just got interested in this patch.. Any reason this hasn't gone to branch_5x as yet? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Fix For: 5.1 > > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303062#comment-14303062 ] ASF subversion and git services commented on LUCENE-3922: - Commit 1656670 from [~cm] in branch 'dev/trunk' [ https://svn.apache.org/r1656670 ] Added JapaneseNumberFilter (LUCENE-3922) > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Fix For: 5.1 > > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296379#comment-14296379 ] Christian Moen commented on LUCENE-3922: Please feel free to test it. Feedback is very welcome. The patch is against {{trunk}} and this should make it into 5.1. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285565#comment-14285565 ] Kazuaki Hiraga commented on LUCENE-3922: [~cm] , sounds great! Can I test this feature? If yes, what version should I use? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173567#comment-14173567 ] Christian Moen commented on LUCENE-3922: Gaute and myself have done testing on real-world data and we've uncovered and fixed a couple of corner-case issues. Our todo items are as follows: # Do additional testing and possible add additional number formats # Document some unsupported cases in unit-tests # Add class-level javadoc # Add a Solr factory > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164954#comment-14164954 ] Christian Moen commented on LUCENE-3922: I've attached a new patch. The {{checkRandomData}} issues were caused by improper handling of token composition for graphs (bug found by [~gaute]). Tokens preceded by position increment zero token are left untouched and so are stacked/synonym tokens. We'll do some more testing and add some documentation before we move forward to commit this. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085909#comment-14085909 ] Christian Moen commented on LUCENE-3922: Gaute and myself have been doing some work on this and we have rewritten this as a {{TokenFilter}}. A few comments: * We have added support for numbers such as 3.2兆円 as you requested, Kazu. * We could potentially use a POS-tag attribute from Kuromoji to identify number that we are composing, but perhaps not relying on POS-tags makes this filter also useful in the case of n-gramming. * We haven't implemented any of the anchoring logic discussed above, i.e. if we to restrict normalization to prices, etc. Is this useful to have? * Input such as {{1,5}} becomes {{15}} after normalization, which could be undesired. Is this bad input or do we want anchoring to retain these numbers? One thing though, in order to support some of this number parsing, i.e. cases such as 3.2兆円, we need to use Kuromoji in a mode that retains punctuation characters. There's also an unresolved issue found by {{checkRandomData}} that we haven't tracked down and fixed, yet. This is a work in progress and feedback is welcome. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475016#comment-13475016 ] Kazuaki Hiraga commented on LUCENE-3922: It would be nice if we can choose expand them or normalize them. I have a concern that Solr's query-side synonym expansion doesn't work well if number of tokens are different between original tokens and synonym tokens, especially if we want to do phrase matching with query-side synonym expansion will be a disaster (Of course, reduction or index-side would be better. But, we sometimes need to use TokenFilter that provides such capability in query-side.) So, I would like to choose the configuration that Kanji numerals normalize to Arabic numerals or Arabic numerals store along with Kanji numerals. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474287#comment-13474287 ] Christian Moen commented on LUCENE-3922: Ohtani-san, I saw your tweet about this earlier and it sounds like a very good idea. Thanks. I will try to set aside some time to work on this. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474268#comment-13474268 ] Jun Ohtani commented on LUCENE-3922: Hi Christian, Kazuaki +1, TokenFilter implementation. And I think that it is helpful, this TokenFilter expand token arabic number and kanji number, like a synonym filter feature. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474257#comment-13474257 ] Kazuaki Hiraga commented on LUCENE-3922: Hi Christian, That what I am thinking. I think TokenFilter would be a good choice to implement that feature. We can use POS tag to recognize what a token is. We can apply normalization if a token is a numeral prefix/suffix with numerals. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474224#comment-13474224 ] Christian Moen commented on LUCENE-3922: Thanks, Kazu. I'm aware of the issue and the thinking is to rework this as a {{TokenFilter}} and use anchoring options with surrounding tokens to decide if normalisation should take place, i.e. if the preceding token is ¥ or the following token is 円 in the case of normalising prices. It might also be helpful to look into using POS-info for this to benefit from what we actually know about the token, i.e. to not apply normalisation if the POS tag is a person name. Other suggestions and ideas are of course most welcome. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474210#comment-13474210 ] Kazuaki Hiraga commented on LUCENE-3922: The following examples are false positive case: "姿三四郎" became "姿", "34", "郎" "小林一茶" became "小林", "1", "茶" "鈴木一郎" became "鈴木", "1", "郎" Can we prevent this behavior? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471132#comment-13471132 ] Christian Moen commented on LUCENE-3922: {quote} Is it difficult to support numbers with period as the following? 3.2兆円 5.2億円 {quote} Supporting this is no problem and a good idea. {quote} I think It would be helpful that this charfilter supports old Kanji numeric characters ("KYU-KANJI" or "DAIJI") such as 壱, 壹 (One), 弌, 弐, 貳 (Two), 弍, 参,參 (Three), or configureable. {quote} This is also easy to support. As for making preserving zeros configurable, that's also possible, of course. It's great to get more feedback on what sort of functionality we need and what should be configurable options. Hopefully, we can find a good balance without adding too much complexity. Thanks for the feedback. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471123#comment-13471123 ] Kazuaki Hiraga commented on LUCENE-3922: Lance, you may be right. Although I have never seen that Japanese people use Kanji numbers for James Bond movies :-), I can't say that we never use Kanji for that kind of expression. Christian, Is it possible to choose preserve leading zeros or not? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471117#comment-13471117 ] Lance Norskog commented on LUCENE-3922: --- bq. On the other hand, I agree with Christian to not preserving leading zeros. So, "◯◯七" doesn't need to become "007". This example shows why leading zeros should be preserved :) There are different kinds of text search. Searching for media titles like James Bond movies is a very different thing from searching newspaper articles. You might want to find "◯◯七" as the Japanese-language release and "007" as the English-language release. These numbers are brands, not numbers. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471068#comment-13471068 ] Kazuaki Hiraga commented on LUCENE-3922: Sorry for this late reply. Although I have some request to improve capability, this is very helpful and nice charfilter for me. Thank you! Christian!! My requests are the following: Is it difficult to support numbers with period as the following? 3.2兆円 5.2億円 On the other hand, I agree with Christian to not preserving leading zeros. So, "◯◯七" doesn't need to become "007". I think It would be helpful that this charfilter supports old Kanji numeric characters ("KYU-KANJI" or "DAIJI") such as 壱, 壹 (One), 弌, 弐, 貳 (Two), 弍, 参,參 (Three), or configureable. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469936#comment-13469936 ] Lance Norskog commented on LUCENE-3922: --- Kazuaki, do have any comment on this fix? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426340#comment-13426340 ] Kazuaki Hiraga commented on LUCENE-3922: Hi Christian, Great! I will test your patch and get back to you!! Thanks, Kazu > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425488#comment-13425488 ] Christian Moen commented on LUCENE-3922: I've attached a work-in-progress patch for {{trunk}} that implements a {{CharFilter}} that normalizes Japanese numbers. These are some TODOs and implementation considerations I have that I'd be thankful to get feedback on: * Buffering the entire input on the first read should be avoided. The primary reason this is done is because I was thinking to add some regexps before and after kanji numeric strings to qualify their normalization, i.e. to only normalize strings that starts with ¥, JPY or ends with 円, to only normalize monetary amounts in Japanese yen. However, this probably isn't necessary as we can probably can use {{Matcher.requireEnd()}} and {{Matcher.hitEnd()}} to decide if we need to read more input. (Thanks, Robert!) * Is qualifying the numbers to be normalized with prefix and suffix regexps useful, i.e. to only normalize monetary amounts? * How do we deal with leading zeros? Currently, "007" and "◯◯七" becomes "7" today. Do we want an option to preserve leading zeros? * How large numbers do we care about supporting? Some of the larger numbers are surrogates, which complicates implementation, but they're certainly possible. If we don't care about really large numbers, we can probably be fine working with {{long}} instead of {{BigInteger}}. * Polite numbers and some other variants aren't supported, i.e. 壱, 弐, 参, etc., but they can easily be added. We can also add the obsolete variants if that's useful somehow. Are these useful? Do we want them available via an option? * Number formats such as "1億2,345万6,789" isn't supported - we don't deal with the comma today, but this can be added. The same applies to "12 345" where there's a space that separates thousands like in French. Numbers like "2・2兆" aren't supported, but can be added. * Only integers are supported today, so we can't parse "〇・一二三四", which becomes "0" and "1234" as separate tokens instead of "0.1234" There are probably other considerations, too, that I doesn't immediately come to mind. Numbers are fairly complicated and feedback on direction for further implementation is most appreciated. Thanks. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239122#comment-13239122 ] Kazuaki Hiraga commented on LUCENE-3922: Koji, Thank you for your comment. I am very interested in the normalizer you have mentioned. Is it possible to choose to concatenate suffix/prefix(年/月/円, etc.) to the Arabic numbers? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0 >Reporter: Kazuaki Hiraga > Labels: features > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238334#comment-13238334 ] Christian Moen commented on LUCENE-3922: Koji, this is very nice. Does the kanji number normalizer ({{KanjiNumberCharFilter}}) also deal with combinations of kanji and arabic numbers like Kazu's price example? Is the above code you refer to something that can go into Lucene or is it non-free software? > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0 >Reporter: Kazuaki Hiraga > Labels: features > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238329#comment-13238329 ] Koji Sekiguchi commented on LUCENE-3922: We, RONDHUIT, have done this kind of normalization (and more!). You may be interested in: http://www.rondhuit-demo.com/RCSS/api/overview-summary.html#featured-japanese ||Summary||normalization sample|| |漢数字=>算用数字正規化|四七=>47, 四十七=>47, 四拾七=>47, 四〇七=>407| |和暦=>西暦正規化|昭和四七年、昭和四十七年、昭和四拾七年=>1972年, 昭和六十四年、平成元年=>1989年| > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0 >Reporter: Kazuaki Hiraga > Labels: features > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238303#comment-13238303 ] Christian Moen commented on LUCENE-3922: Thanks a lot, Kazu. This is a good idea to add. Patches are of course also very welcome! :) > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0 >Reporter: Kazuaki Hiraga > Labels: features > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org