If you are not tied to Java, see 'unac' at http://www.senga.org/.
It's old, but if nothing else you could see how it works and rewrite it
in Java.  And if you can, you can donate it to Lucene Sandbox.

Otis

--- Peter Pimley <[EMAIL PROTECTED]> wrote:

> 
> Hi everyone,
> 
> The Question:
> In Java generally, Is there an easy way to get the unicode name of a 
> character?  (e.g. "LATIN SMALL LETTER A" from 'a')
> 
> 
> The Reasoning (for those who are interested):
> The documents I'm indexing have quite a lot of characters that are 
> basically variations on the basic A-Z ones.  In my analysis step, I'd
> 
> like to convert these to their closest equivalent in the basic A-Z
> set.
> 
> For some letters, this is easy.  An example is the e-acute character 
> (00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
> plain 'e'.  I can do that by using the IBM ICU4J tools to decompose
> the 
> single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then
> I 
> can strip all characters that fail Character.isLetterOrDigit.  That 
> works fine.
> 
> Some characters however do not decompose.  An example is the
> character 
> 01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with
> 
> 'P', but it does not decompose into P + something.
> 
> I'm considering taking the unicode name for each character I
> encounter 
> and regexping it against something like:
> ^LATIN .* LETTER (.) WITH .*$
> ... to try and extract the single A-Z|a-z character.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to