Re: Encoding and Fileencoding of a latin1 file

Tony Mechelynck Sun, 06 Jul 2014 06:09:16 -0700

See also http://vim.wikia.com/wiki/Working_with_Unicode

You seem to have read that article, which I wrote myself, so I'll try toexplain in more detail (I hope not in boring detail) the logic behindit. Be sure to check the Vim help for anything which would still be unclear.

'encoding' is a global option determining how Vim represents charactersin memory. The right place to set it is in your vimrc, BEFORE loadingany editfile. Once you have started opening a file, changing 'encoding'makes the contents of ALL your current editfiles invalid, because it isnot possible to convert all the contents of all your loaded buffers fromone encoding to another as a result of your changing that option.

The :scriptencoding ex-command (not mentioned in that wiki page) tellsVim to override 'encoding' for the purpose of reading the currentscript. For instance if your vimrc is encoded in Windows-1252 you can use

        scriptencoding Windows-1252

and any bytes between 0x80 and 0xFF in your script will be interpretedas in Windows-1252 even after you set 'encoding' to UTF-8.

'fileencoding' (singular) is a local option. It says how the file inquestion will be represented on disk. If 'encoding' is UTF-8(recommended) and if your Vim can use iconv (i.e., has(iconv) returns 1,i.e. you either have +iconv linked-in statically, or +iconv/dyncompiled-in dynamically and the iconv or libiconv library found atruntime), then any encoding can be translated to and from UTF-8, and Vimcan do just that when reading and writing. But note that if 'encoding'is set to UTF-8, and you modify a file to put in it characters notacceptable for that file's 'fileencoding', Vim will give you no errorsignal as long as you don't save the file; so you can change the'fleencoding' before or after you change the file contents: as long asthey agree when you write the file it's OK.

If the file contains only bytes less than 0x80, it will be interpretedidentically in any of the following encodings (where those I'm writingon one line are synonyms, equivalent for Vim with iconv), and in anumber of others:

- us-ascii
- latin1, iso-8859-1
- cp1252, Windows-1252
- latin9, iso-8859-15
- utf-8

so don't be afraid if Vim detects one of your Latin1 files (with noaccented characters, French guillemets, etc.) as being UTF-8. In fact,with those contents, it could just as well be any of the encodingsmentioned above (or a number of others). If you want to be sure that agiven file remains Latin1 even if you add accented characters to it inthe future, be sure to add some non-ASCII characters in it now (e.g.,for text, underline the main heading with a line of ÷÷÷÷÷÷÷÷÷÷÷ Americandivided-by signs), then save it immediately with one of

        :x ++enc=latin1
or
        :setl fenc=latin1
        :w

Similarly for Windows-1252 or iso-8859-15, but use a different non-ASCIIcharacter, since they both are supersets of Latin1. On a side note,sometimes I notice that I send an email with headers declaring it to be8bit utf-8 and that it comes back to me as 7bit us-ascii; the body, inthat case, is byte-for-byte identical. (This one won't, because of thedivided-by signs above. Maybe it'll come back as quoted-printable utf-8,or even as quoted-printable iso-8859-1.)

To convert a file from one encoding to another (e.g. Windows-1252 toUTF-8, and assuming that both can be represented in your present'encoding'), it is extremely easy to do it with Vim (if has(iconv)returns 1 of course), as follows:

        :e ++enc=Windows-1252 filename
        :setl fenc=utf-8
        :w

You ask what it means to use ":setglobal fileencoding=utf-8". That tellsVim what 'fileencoding' value to use when you create a new file whichdidn't exist before. Or you could use ":setglobalfileencoding=Windows-1252" which will create files by default inWindows-1252 encoding, but of course in that case you will get a signalat write-time (and not before) if you write in the file something thathas no representation in Windows-1252. See ":help local-options".

++enc=something (before the filename in a file-read or file-writecommand such as :e or :saveas) tells Vim the 'fileencoding' to use forthis read or write. When reading, it also sets 'fileencoding' (locally)for the file regardless of the 'fileencodings' heuristics. In spite ofits name, this ++enc modifier has NOTHING TO DO with 'encoding' but onlywith 'fileencoding'.

'fileencodings' (plural) is a comma-separated list of values of'fileencoding' (singular) to be tried when opening an editfile withoutthe++enc modifier. They are tested from left to right in sequence:

- ucs-bom (if present) should be first. It will test the first few bytesof the first against the possible representations of U+FEFF in thevarious Unicode encodings. If found, and the rest of the file agreeswith that particular encoding, it will set 'bomb' to true and'fileencoding' to the corresponding encoding. In that case theheuristics ends there. Otherwise 'bomb' is set to false and the nextencoding is tried.

- Any multibyte encoding (for instance utf-8) tests the contents of thefile against the admissible character values for that encoding. If anerror is found, the test ends there (gives a "fail" result) and the nextencoding in sequence is tested. If the end of the file is reached withno error (all bytes and byte sequences are acceptable for thatencoding), 'fileencoding' is set and the heuristics ands.

- An 8-byte encoding can never fail: it will set 'fileencoding' with notest. IOW there should be at most one 8-byte encoding, and it should belast. If there are more than one 8-byte encoding, Vim won't give anerror, it will just never try anything (not even a multybyte encoding,if present) after the first 8-byte encoding.

- The value "default" is special: it means the value from your OSlocale, i.e. the value which 'encoding' had before sourcing any startupscript, even the system vimrc. It may be useful to put it last if youdon't already try an 8-bit encoding before that.



Conclusion:

Vim has no built-in mechanism to sort Windows-1252, iso-8859-15 andLatin1 apart from each other. They are all 8-bit encodings, andsometimes one of the former two is used for the latter. You will have,for each of your files, to know which is which and, if necessary, usethe appropriate ++enc modifier when reading it. This will set'fileencoding' to what you tell Vim, and the same encoding will be usedwhen writing. Just make sure that if you guess wrong, you notice itimmediately, and read the file again in another 'fileencoding' beforeyou modify it.




Best regards,
Tony.

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---You received this message because you are subscribed to the Google Groups "vim_use" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to vim_use+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Encoding and Fileencoding of a latin1 file

Reply via email to