Re: latin1 words in an utf-8 file

A.J.Mechelynck Sat, 23 Sep 2006 00:54:59 -0700

Christian Ebert wrote:

Hello,


Is it possible to have eg. iso-8859-1 encoded words/passages in
an otherwise utf-8 encoded file? I mean, w/o automatic
conversion, and I don't need the iso passages displayed in a
readable way, but so I can still write the file in utf-8 w/o
changing the "invalid" iso-8859-1 chars?

Hm, hope I made myself clear.

TIA

c


"Valid" UTF-8 byte sequences are as follows:

Codepoints U+0000 to U+007F are represented as one byte each, 0x00 to 0x7F

Codepoints U+0080 to U+07FF are represented as two bytes each, in binary:110x.xxxx 10xx.xxxx where x represents the bits of the codepoint number.

Codepoints U+0800 to U+FFFF are represented as three bytes each, in binary:1110.xxxx 10xx.xxxx 10xx.xxxx


Codepoints U+10000 to U+10FFFF are represented as four bytes each, in binary:
1111.0xxx 10xx.xxxx 10xx.xxxx 10xx.xxxx

In the current version of the Unicode Standard, codepoints U+110000 toU+3FFFFFFF, which were initially foreseen, have been declared invalid; howeverVim accepts them. They are represented by four to six bytes each, extendingthe above scheme up to U+7FFFFFFF = FD BF BF BF BF BF

All other byte sequences are invalid. The 'fileencodings' algorithm fordetecting UTF-8 files should (properly) declare that any file containing bytesequences other than the above is not valid UTF-8.

Conclusion: Latin1 text strings which include only 7-bit US-ASCII data arerepresented identically in US-ASCII, Latin1 and UTF-8 and may therefore beincluded into UTF-8 files without any conversion. Latin1 characters higherthan 0x7F must be translated to two-byte UTF-8 byte sequences, otherwise theUTF-8 file will become invalid.


Corollary of the conclusion:

#1.
cat file1.utf8.txt file2.latin1.txt file3.utf8.txt > file99.utf8.txt

will produce invalid output unless the Latin1 input file is actually 7-bitUS-ASCII. This is not a limitation of the "cat" program (which inherentlynever translates anything) but a false manoeuver on the part of the user.


#2.
gvim
        :if &tenc == "" | let &tenc = &enc | endif
        :set enc=utf-8 fencs=utf-bom,utf-8,latin1
        :e ++enc=utf-8 file1.utf8.txt
        :$r ++enc=latin1 file2.latin1.txt
        :$r ++enc=utf-8 file3.utf-8.txt
        :saveas file99.utf8.txt

will produce valid output in all cases (assuming your Vim executable has+multi_byte compiled-in), because Vim does the necessary conversionautomatically. In most cases, the ++enc=<encoding> options are not evennecessary, because the 'fileencodings' heuristics usually detect the file'scharset correctly.

If you have an invalid file (such as the one produced by the "cat" commandabove) you can (in Vim 7 only) use the 8g8 command (q.v.) (with 'encoding'and/or 'fileencoding' set to an 8-bit encoding) to locate invalid UTF-8 bytes.



Best regards,
Tony.

Re: latin1 words in an utf-8 file

Reply via email to