2009/5/13 Christian Weisgerber <na...@mips.inka.de>: > > Being able to use it with a Unix-style filesystem is one of UTF-8's > design principles. > > All ASCII characters (0-127) represent themselves; all characters >>127 are represented by sequences of bytes with the top bit set.
I recently looked into this. Apologies if this is old news to people, but I personally found it noteworthy how e.g. UTF-8 Unicode characters are defined. I looked at this: http://en.wikipedia.org/wiki/UTF-8#Description The thing is, anything above the old backwards-compatible ASCII 0-127 chars will most likely be listed in character tables as U+abcd, where abcd is a hex number (it might even be 6 digits). But as soon as you're above the old ASCII 0-127 character space, the U+abcd hex number is not the hex number you will see if you look at the string with hexdump -C. That's because certain bits are kind of preordained and prescribed by the standard. The table at the above Wikipedia link explains how you can get from the U+abcd hex to the actual hex you'd see on disk and vice versa. (Pay attention to the underlined parts vs. the non-underlined parts and you'll quickly get it.) Again, maybe you all know this, but I only learned this recently; maybe this helps somebody; otherwise sorry for the noise. regards, --ropers