Christian Ebert wrote:
Hello,

Is it possible to have eg. iso-8859-1 encoded words/passages in
an otherwise utf-8 encoded file? I mean, w/o automatic
conversion, and I don't need the iso passages displayed in a
readable way, but so I can still write the file in utf-8 w/o
changing the "invalid" iso-8859-1 chars?

Hm, hope I made myself clear.

TIA

c

"Valid" UTF-8 byte sequences are as follows:

Codepoints U+0000 to U+007F are represented as one byte each, 0x00 to 0x7F

Codepoints U+0080 to U+07FF are represented as two bytes each, in binary: 110x.xxxx 10xx.xxxx where x represents the bits of the codepoint number.

Codepoints U+0800 to U+FFFF are represented as three bytes each, in binary: 1110.xxxx 10xx.xxxx 10xx.xxxx

Codepoints U+10000 to U+10FFFF are represented as four bytes each, in binary:
1111.0xxx 10xx.xxxx 10xx.xxxx 10xx.xxxx

In the current version of the Unicode Standard, codepoints U+110000 to U+3FFFFFFF, which were initially foreseen, have been declared invalid; however Vim accepts them. They are represented by four to six bytes each, extending the above scheme up to U+7FFFFFFF = FD BF BF BF BF BF

All other byte sequences are invalid. The 'fileencodings' algorithm for detecting UTF-8 files should (properly) declare that any file containing byte sequences other than the above is not valid UTF-8.

Conclusion: Latin1 text strings which include only 7-bit US-ASCII data are represented identically in US-ASCII, Latin1 and UTF-8 and may therefore be included into UTF-8 files without any conversion. Latin1 characters higher than 0x7F must be translated to two-byte UTF-8 byte sequences, otherwise the UTF-8 file will become invalid.

Corollary of the conclusion:

#1.
cat file1.utf8.txt file2.latin1.txt file3.utf8.txt > file99.utf8.txt

will produce invalid output unless the Latin1 input file is actually 7-bit US-ASCII. This is not a limitation of the "cat" program (which inherently never translates anything) but a false manoeuver on the part of the user.

#2.
gvim
        :if &tenc == "" | let &tenc = &enc | endif
        :set enc=utf-8 fencs=utf-bom,utf-8,latin1
        :e ++enc=utf-8 file1.utf8.txt
        :$r ++enc=latin1 file2.latin1.txt
        :$r ++enc=utf-8 file3.utf-8.txt
        :saveas file99.utf8.txt

will produce valid output in all cases (assuming your Vim executable has +multi_byte compiled-in), because Vim does the necessary conversion automatically. In most cases, the ++enc=<encoding> options are not even necessary, because the 'fileencodings' heuristics usually detect the file's charset correctly.

If you have an invalid file (such as the one produced by the "cat" command above) you can (in Vim 7 only) use the 8g8 command (q.v.) (with 'encoding' and/or 'fileencoding' set to an 8-bit encoding) to locate invalid UTF-8 bytes.


Best regards,
Tony.

Reply via email to