Unicode Normalization

David Woodward Wed, 11 Apr 2007 13:01:30 -0700

Hi.

I have encountered a problem searching in my application because of 
inconsistant unicode normalization forms in the corpus (and the queries). I 
would like to normalize to form NFKD in an analyzer (I think). I was thinking 
about creating a filter similar to the lowercasefilter that would do the 
unicode normalization. Then I will add that filter to my existing snowball 
analyzer. I am about to embark on creating said analyzer/filter using the ICU 
(http://icu-project.org/) icu4j jar.


Is this already accounted for in standard lucene somewhere and I'm just missing 
it?

Anything similar out there?

Any other advice?

Thanks,
Dave Wooodward
Library of Congress


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Unicode Normalization

Reply via email to