: I have encountered a problem searching in my application because of : inconsistant unicode normalization forms in the corpus (and the : queries). I would like to normalize to form NFKD in an analyzer (I : think). I was thinking about creating a filter similar to the
i'm very naive to the multitudes of issues with charsets and charencodings, but isn't the a problem best solved well when First constructing the java String or Reader object -- either from a file on disk or from a network socket of some kind? or am i missunderstanding your meaning of the word Normalization? at first i thought you might be talking about something like the ISOLatin1AccentFilter but then i looked at the ICU url you mentioned and it seems to be all about byte=>character issues ... that doesn't sound like something you would really want to be doing in an Analyzer. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]