[ https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641071#action_12641071 ]
Walter Underwood commented on SOLR-815: --------------------------------------- I looked it up, and even found a reason to do it the right way. Latin should be normalized to halfwidth (in the Latin-1 character space). Kana should be normalized to fullwidth. Normalizing Latin characters to fullwidth would mean you could not use the existing accent-stripping filters or probably any other filter that expected Latin-1, like synonyms. Normalizing to halfwidth makes the rest of Solr and Lucene work as expected. See section 12.5: http://www.unicode.org/versions/Unicode5.0.0/ch12.pdf The compatability forms (the ones we normalize away from) are int the Unicode range U+FF00 to U+FFEF. The correct mappings from those forms are in this doc: http://www.unicode.org/charts/PDF/UFF00.pdf Other charts are here: http://www.unicode.org/charts/ > Add new Japanese half-width/full-width normalizaton Filter and Factory > ---------------------------------------------------------------------- > > Key: SOLR-815 > URL: https://issues.apache.org/jira/browse/SOLR-815 > Project: Solr > Issue Type: New Feature > Components: search > Affects Versions: 1.3 > Reporter: Todd Feak > Assignee: Koji Sekiguchi > Priority: Minor > Attachments: SOLR-815.patch > > > Japanese Katakana and Latin alphabet characters exist as both a "half-width" > and "full-width" version. This new Filter normalizes to the full-width > version to allow searching and indexing using both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.