On 25 Mar 2009, at 17:55, Francisco Vila wrote:
I am now confused because Trevor has said that the hex value is a
variable length coding value for the Unicode entity, therefore this
hex number has to follow the utf-8 rules, not utf-32 which is always a
32bit fixed-length value.
...
... after Trevor I now think the hex value _is_ utf-8
coded. I might be completely wrong.
You might search this page for "code point":
http://en.wikipedia.org/wiki/Unicode
It just a natural number assign to each abstract character it defines.
The section
http://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology
describes the convention of writing these numbers with the prefix "U
+": numbers below 2^16 are written with four hex digit, and other with
five or six as is needed.
Then, in order to get it into a computer, one uses an encoding that
translates these numbers into byte sequences. Among these are UTF-8,
UTF-16 and UTF-32. The last, UTF-32 ought to be simplest, because it
just takes the code point in binary number base, but since one does
not agree on how to sort out the order of bytes in a computer, there
are two: UTF-32BE (big endian, used by PowerPC) and UTF-32LE (little
endian, used by IntelPC). Similarly for UTF-16, which was invented in
the days when one thought 16 would be enough for all Unicode, but
later extended in an irregular way.
UTF-8 does not have this endianness problem, as mostly one today
mostly agrees on how to sort out the bits in a byte. It was invented
for use on UNIX computers. It is constructed so that bytes with
highest bit 0 have the same value in ASCI, and all other characters
have highest set to 1 and are multibyte. It was adopted by Unicode,
which imposed a limit on the number of characters. So strictly
speaking, there are two UTF-8.
Hans Aberg
_______________________________________________
bug-lilypond mailing list
bug-lilypond@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-lilypond