[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085909#comment-14085909
 ] 

Christian Moen commented on LUCENE-3922:
----------------------------------------

Gaute and myself have been doing some work on this and we have rewritten this 
as a {{TokenFilter}}.

A few comments:

* We have added support for numbers such as 3.2兆円 as you requested, Kazu.
* We could potentially use a POS-tag attribute from Kuromoji to identify number 
that we are composing, but perhaps not relying on POS-tags makes this filter 
also useful in the case of n-gramming.
* We haven't implemented any of the anchoring logic discussed above, i.e. if we 
to restrict normalization to prices, etc. Is this useful to have?
* Input such as {{1,5}} becomes {{15}} after normalization, which could be 
undesired. Is this bad input or do we want anchoring to retain these numbers?

One thing though, in order to support some of this number parsing, i.e. cases 
such as 3.2兆円, we need to use Kuromoji in a mode that retains punctuation 
characters.

There's also an unresolved issue found by {{checkRandomData}} that we haven't 
tracked down and fixed, yet.

This is a work in progress and feedback is welcome.

> Add Japanese Kanji number normalization to Kuromoji
> ---------------------------------------------------
>
>                 Key: LUCENE-3922
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3922
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>            Reporter: Kazuaki Hiraga
>              Labels: features
>         Attachments: LUCENE-3922.patch, LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to