On 15/03/11 15:49, Gary Johnson wrote:
On 2011-03-11, Nikolai Weibull wrote:

But this is a big "whatever".  As latin1 (or, more appropriately,
iso-8859-1) is a superset of ASCII and Unicode is a superset of
latin1, then what I really care about is having support for Unicode
quotes.

Latin1 is a superset of ASCII, but Unicode is not a superset of
latin1.  Unicode supports a larger set of characters than latin1 and
shares some character encodings in common with latin1 but it is a
different encoding.

Regards,
Gary


Unicode is a superset of Latin1 in the sense that every Latin1 character is also a Unicode codepoint, and at the same ordinal position (the first 256 Unicode codepoints are the 256 Latin1 characters in the same order).

However no Unicode encoding represents Latin1 characters higher than 0x7F *on disk* by the same binary value that Latin1 does (UTF-8, but not the other Unicode encodings except maybe --I'm not sure-- GB18030, represents the 128 US-ASCII characters the same way as both US-ASCII and Latin1).

<encyclopedia>
The above paragraph implies that Unicode is not *one* encoding, even though Vim represents all Unicode codepoints the same way *in memory*. Rather, Unicode should be seen as a way of classifying all known writing systems as a one-dimensional list going from zero to "something high" by integer steps or "codepoints". These codepoints may be coded as bytes in different ways: * UTF-8, which uses one or more bytes per codepoint, and where the byte 0x00 can only represent the codepoint U+0000 (the null codepoint) so it's useful for a representation using C strings. The first byte used for any codepoint tells how many bytes there will be in all, the other ones (if any) have values which cannot happen in the first byte, so synchronization is easy even if corrupt bytes become embedded in the text. * UCS-2, which uses one two-byte word (big-endian or little-endian) per codepoint and cannot represent any codepoint higher than U+FFFF * UTF-16, which extends UCS-2 up to U+10FFFF by means of "surrogate codepoints", using two words for codepoints higher than U+FFFF * UCS-4 aka UTF-32, which can be big-endian or little-endian (or even, I've been told, ordered 2143 or 3412) and uses one four-byte doubleword per codepoint. It simply stores each codepoint as its ordinal value expressed as one unsigned 32-bit integer. * GB18030, which is skewed in favour of Chinese; it allows representation of any Unicode codepoint but the conversion in either direction between it and other Unicode encodings requires bulky tables.

Conversion between any of the above except GB18030 is trivial; Vim does it with no need for the iconv library. For UCS-2, UTF-16 and UTF-32, when the endianness is omitted, big-endian is implied, even on little-endian processors such as the Intel ones used in all Windows PCs, most Linux ones, and many of those equipped with Mac OSX.
</encyclopedia>


Best regards,
Tony.
--
Champagne don't make me lazy.
Cocaine don't drive me crazy.
Ain't nobody's business but my own.
                -- Taj Mahal

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Raspunde prin e-mail lui