2009/5/13 Christian Weisgerber <na...@mips.inka.de>:
>
> Being able to use it with a Unix-style filesystem is one of UTF-8's
> design principles.
>
> All ASCII characters (0-127) represent themselves; all characters
>>127 are represented by sequences of bytes with the top bit set.

I recently looked into this. Apologies if this is old news to people,
but I personally found it noteworthy how e.g. UTF-8 Unicode characters
are defined. I looked at this:

http://en.wikipedia.org/wiki/UTF-8#Description

The thing is, anything above the old backwards-compatible ASCII 0-127
chars will most likely be listed in character tables as U+abcd, where
abcd is a hex number (it might even be 6 digits). But as soon as
you're above the old ASCII 0-127 character space, the U+abcd hex
number is not the hex number you will see if you look at the string
with hexdump -C. That's because certain bits are kind of preordained
and prescribed by the standard. The table at the above Wikipedia link
explains how you can get from the U+abcd hex to the actual hex you'd
see on disk and vice versa. (Pay attention to the underlined parts vs.
the non-underlined parts and you'll quickly get it.)

Again, maybe you all know this, but I only learned this recently;
maybe this helps somebody; otherwise sorry for the noise.

regards,
--ropers

Reply via email to