On 15/03/11 15:49, Gary Johnson wrote:
On 2011-03-11, Nikolai Weibull wrote:
But this is a big "whatever". As latin1 (or, more appropriately,
iso-8859-1) is a superset of ASCII and Unicode is a superset of
latin1, then what I really care about is having support for Unicode
quotes.
Latin1 is a superset of ASCII, but Unicode is not a superset of
latin1. Unicode supports a larger set of characters than latin1 and
shares some character encodings in common with latin1 but it is a
different encoding.
Regards,
Gary
Unicode is a superset of Latin1 in the sense that every Latin1 character
is also a Unicode codepoint, and at the same ordinal position (the first
256 Unicode codepoints are the 256 Latin1 characters in the same order).
However no Unicode encoding represents Latin1 characters higher than
0x7F *on disk* by the same binary value that Latin1 does (UTF-8, but not
the other Unicode encodings except maybe --I'm not sure-- GB18030,
represents the 128 US-ASCII characters the same way as both US-ASCII and
Latin1).
<encyclopedia>
The above paragraph implies that Unicode is not *one* encoding, even
though Vim represents all Unicode codepoints the same way *in memory*.
Rather, Unicode should be seen as a way of classifying all known writing
systems as a one-dimensional list going from zero to "something high" by
integer steps or "codepoints". These codepoints may be coded as bytes in
different ways:
* UTF-8, which uses one or more bytes per codepoint, and where the byte
0x00 can only represent the codepoint U+0000 (the null codepoint) so
it's useful for a representation using C strings. The first byte used
for any codepoint tells how many bytes there will be in all, the other
ones (if any) have values which cannot happen in the first byte, so
synchronization is easy even if corrupt bytes become embedded in the text.
* UCS-2, which uses one two-byte word (big-endian or little-endian) per
codepoint and cannot represent any codepoint higher than U+FFFF
* UTF-16, which extends UCS-2 up to U+10FFFF by means of "surrogate
codepoints", using two words for codepoints higher than U+FFFF
* UCS-4 aka UTF-32, which can be big-endian or little-endian (or even,
I've been told, ordered 2143 or 3412) and uses one four-byte doubleword
per codepoint. It simply stores each codepoint as its ordinal value
expressed as one unsigned 32-bit integer.
* GB18030, which is skewed in favour of Chinese; it allows
representation of any Unicode codepoint but the conversion in either
direction between it and other Unicode encodings requires bulky tables.
Conversion between any of the above except GB18030 is trivial; Vim does
it with no need for the iconv library. For UCS-2, UTF-16 and UTF-32,
when the endianness is omitted, big-endian is implied, even on
little-endian processors such as the Intel ones used in all Windows PCs,
most Linux ones, and many of those equipped with Mac OSX.
</encyclopedia>
Best regards,
Tony.
--
Champagne don't make me lazy.
Cocaine don't drive me crazy.
Ain't nobody's business but my own.
-- Taj Mahal
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php