petite_abeille wrote:

I would love to see this. I presently have a somewhat unwieldy conversion table [1] that I would love to get ride of :))
[snip]
[1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt

I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it. It's based on a slightly different principle to yours. You seem to look for things like "mumble mumble LETTER X mumble" and take "X" as the base letter. That means that, for example, ɖ (a "d" with a hook) gets converted to "d". My script, on the other hand, deals with things like "Ǣ" (LATIN CAPITAL LETTER AE WITH MACRON) and converts it to AE. There are some differences of opinion though, you have ß mapped to "s" whereas I have "ss" ("strße" to "strasse" instead of "strase" seems right). I think I'm also over-enthusiastic when it comes to mapping characters to spaces: I know that there are some arabic characters that get mapped to spaces. For the purposes of converting to an ASCII approximation, though, I suspect a combination of your approach and mine would be best. What do you think?

Of course, it's still unweildy -- the code uses a huge great switch statement. It would be more aesthetically pleasing to have a class representing UnicodeData.txt and work out the mapping on the fly. IBM have some Unicode stuff that deals with decomposition and uses a similar algorithm (I think) to the one I use. The standard java.lang.Character has everything but the decompositions to implement what I do in perl in Java: generating a map of decompositions isn't difficult though. However, I doubt whether the reduction in code size would make it run faster and certainly looking at the name of the letter to determine the ASCII nearest equivalent is going to be slow.

jch

Attachment: mkswitch.pl
Description: Perl program

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to