[ 
https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641071#action_12641071
 ] 

Walter Underwood commented on SOLR-815:
---------------------------------------

I looked it up, and even found a reason to do it the right way.

Latin should be normalized to halfwidth (in the Latin-1 character space).

Kana should be normalized to fullwidth.

Normalizing Latin characters to fullwidth would mean you could not use the 
existing accent-stripping filters or probably any other filter that expected 
Latin-1, like synonyms. Normalizing to halfwidth makes the rest of Solr and 
Lucene work as expected.

See section 12.5: http://www.unicode.org/versions/Unicode5.0.0/ch12.pdf

The compatability forms (the ones we normalize away from) are int the Unicode 
range U+FF00 to U+FFEF.
The correct mappings from those forms are in this doc: 
http://www.unicode.org/charts/PDF/UFF00.pdf

Other charts are here: http://www.unicode.org/charts/


> Add new Japanese half-width/full-width normalizaton Filter and Factory
> ----------------------------------------------------------------------
>
>                 Key: SOLR-815
>                 URL: https://issues.apache.org/jira/browse/SOLR-815
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Todd Feak
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>         Attachments: SOLR-815.patch
>
>
> Japanese Katakana and  Latin alphabet characters exist as both a "half-width" 
> and "full-width" version. This new Filter normalizes to the full-width 
> version to allow searching and indexing using both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to