Hallvard B Furuseth wrote:
> I need a function which converts Latin Unicode characters to 
> the closest equivalent ASCII characters, e.g. "é" -> "e".
> 
> Before I reinvent the wheel, does any public domain or GPL 
> code for this already exist?

I don't know, sorry.

> If not,
> for the most part I expect I can make the mapping from the character
> names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
> in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.

Why the name!?

The decomposition property (5th filed on each line) is much better for this.
E.g.:

        00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301;;;;N;LATIN
SMALL LETTER E ACUTE;;00C9;;00C9

The decomposition field tells you that "é" (code 00E9 hex) is composed of
ASCII "e" (code 0065 hex) and the combining acute accent (code 0301 hex):
you keep the ASCII character and drop the composing accent.

> Punctuation and other non-letters will be worse, but they are less
> important to me anyway.

The result is much better if you allow the ASCII conversion to be a string.
This allows you to, e.g., "©" = "(c)", "½" = "1/2", and so on. This is also
good for letters: "ß" = "ss", "å" = "aa", etc.

_ Marco


Reply via email to