If you are not tied to Java, see 'unac' at http://www.senga.org/. It's old, but if nothing else you could see how it works and rewrite it in Java. And if you can, you can donate it to Lucene Sandbox.
Otis --- Peter Pimley <[EMAIL PROTECTED]> wrote: > > Hi everyone, > > The Question: > In Java generally, Is there an easy way to get the unicode name of a > character? (e.g. "LATIN SMALL LETTER A" from 'a') > > > The Reasoning (for those who are interested): > The documents I'm indexing have quite a lot of characters that are > basically variations on the basic A-Z ones. In my analysis step, I'd > > like to convert these to their closest equivalent in the basic A-Z > set. > > For some letters, this is easy. An example is the e-acute character > (00E9 LATIN SMALL LETTER E WITH ACUTE). I'd like to turn that into > plain 'e'. I can do that by using the IBM ICU4J tools to decompose > the > single character into two; 'e' and 0301 COMBINING ACUTE ACCENT. Then > I > can strip all characters that fail Character.isLetterOrDigit. That > works fine. > > Some characters however do not decompose. An example is the > character > 01A4 LATIN CAPITAL LETTER P WITH HOOK. I'd like to replace that with > > 'P', but it does not decompose into P + something. > > I'm considering taking the unicode name for each character I > encounter > and regexping it against something like: > ^LATIN .* LETTER (.) WITH .*$ > ... to try and extract the single A-Z|a-z character. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]