Re: Spelling support doesn´t deal with `´´ correctly

Tony Mechelynck Tue, 15 Mar 2011 23:37:21 -0700

On 15/03/11 15:49, Gary Johnson wrote:

On 2011-03-11, Nikolai Weibull wrote:

But this is a big "whatever".  As latin1 (or, more appropriately,
iso-8859-1) is a superset of ASCII and Unicode is a superset of
latin1, then what I really care about is having support for Unicode
quotes.


Latin1 is a superset of ASCII, but Unicode is not a superset of
latin1.  Unicode supports a larger set of characters than latin1 and
shares some character encodings in common with latin1 but it is a
different encoding.

Regards,
Gary

Unicode is a superset of Latin1 in the sense that every Latin1 characteris also a Unicode codepoint, and at the same ordinal position (the first256 Unicode codepoints are the 256 Latin1 characters in the same order).

However no Unicode encoding represents Latin1 characters higher than0x7F *on disk* by the same binary value that Latin1 does (UTF-8, but notthe other Unicode encodings except maybe --I'm not sure-- GB18030,represents the 128 US-ASCII characters the same way as both US-ASCII andLatin1).


<encyclopedia>

The above paragraph implies that Unicode is not *one* encoding, eventhough Vim represents all Unicode codepoints the same way *in memory*.Rather, Unicode should be seen as a way of classifying all known writingsystems as a one-dimensional list going from zero to "something high" byinteger steps or "codepoints". These codepoints may be coded as bytes indifferent ways:* UTF-8, which uses one or more bytes per codepoint, and where the byte0x00 can only represent the codepoint U+0000 (the null codepoint) soit's useful for a representation using C strings. The first byte usedfor any codepoint tells how many bytes there will be in all, the otherones (if any) have values which cannot happen in the first byte, sosynchronization is easy even if corrupt bytes become embedded in the text.* UCS-2, which uses one two-byte word (big-endian or little-endian) percodepoint and cannot represent any codepoint higher than U+FFFF* UTF-16, which extends UCS-2 up to U+10FFFF by means of "surrogatecodepoints", using two words for codepoints higher than U+FFFF* UCS-4 aka UTF-32, which can be big-endian or little-endian (or even,I've been told, ordered 2143 or 3412) and uses one four-byte doublewordper codepoint. It simply stores each codepoint as its ordinal valueexpressed as one unsigned 32-bit integer.* GB18030, which is skewed in favour of Chinese; it allowsrepresentation of any Unicode codepoint but the conversion in eitherdirection between it and other Unicode encodings requires bulky tables.

Conversion between any of the above except GB18030 is trivial; Vim doesit with no need for the iconv library. For UCS-2, UTF-16 and UTF-32,when the endianness is omitted, big-endian is implied, even onlittle-endian processors such as the Intel ones used in all Windows PCs,most Linux ones, and many of those equipped with Mac OSX.

</encyclopedia>


Best regards,
Tony.
--
Champagne don't make me lazy.
Cocaine don't drive me crazy.
Ain't nobody's business but my own.
                -- Taj Mahal

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: Spelling support doesn´t deal with `´´ correctly

Raspunde prin e-mail lui