See also http://vim.wikia.com/wiki/Working_with_Unicode

You seem to have read that article, which I wrote myself, so I'll try to explain in more detail (I hope not in boring detail) the logic behind it. Be sure to check the Vim help for anything which would still be unclear.


'encoding' is a global option determining how Vim represents characters in memory. The right place to set it is in your vimrc, BEFORE loading any editfile. Once you have started opening a file, changing 'encoding' makes the contents of ALL your current editfiles invalid, because it is not possible to convert all the contents of all your loaded buffers from one encoding to another as a result of your changing that option.


The :scriptencoding ex-command (not mentioned in that wiki page) tells Vim to override 'encoding' for the purpose of reading the current script. For instance if your vimrc is encoded in Windows-1252 you can use
        scriptencoding Windows-1252
and any bytes between 0x80 and 0xFF in your script will be interpreted as in Windows-1252 even after you set 'encoding' to UTF-8.


'fileencoding' (singular) is a local option. It says how the file in question will be represented on disk. If 'encoding' is UTF-8 (recommended) and if your Vim can use iconv (i.e., has(iconv) returns 1, i.e. you either have +iconv linked-in statically, or +iconv/dyn compiled-in dynamically and the iconv or libiconv library found at runtime), then any encoding can be translated to and from UTF-8, and Vim can do just that when reading and writing. But note that if 'encoding' is set to UTF-8, and you modify a file to put in it characters not acceptable for that file's 'fileencoding', Vim will give you no error signal as long as you don't save the file; so you can change the 'fleencoding' before or after you change the file contents: as long as they agree when you write the file it's OK.

If the file contains only bytes less than 0x80, it will be interpreted identically in any of the following encodings (where those I'm writing on one line are synonyms, equivalent for Vim with iconv), and in a number of others:
- us-ascii
- latin1, iso-8859-1
- cp1252, Windows-1252
- latin9, iso-8859-15
- utf-8
so don't be afraid if Vim detects one of your Latin1 files (with no accented characters, French guillemets, etc.) as being UTF-8. In fact, with those contents, it could just as well be any of the encodings mentioned above (or a number of others). If you want to be sure that a given file remains Latin1 even if you add accented characters to it in the future, be sure to add some non-ASCII characters in it now (e.g., for text, underline the main heading with a line of ÷÷÷÷÷÷÷÷÷÷÷ American divided-by signs), then save it immediately with one of
        :x ++enc=latin1
or
        :setl fenc=latin1
        :w
Similarly for Windows-1252 or iso-8859-15, but use a different non-ASCII character, since they both are supersets of Latin1. On a side note, sometimes I notice that I send an email with headers declaring it to be 8bit utf-8 and that it comes back to me as 7bit us-ascii; the body, in that case, is byte-for-byte identical. (This one won't, because of the divided-by signs above. Maybe it'll come back as quoted-printable utf-8, or even as quoted-printable iso-8859-1.)

To convert a file from one encoding to another (e.g. Windows-1252 to UTF-8, and assuming that both can be represented in your present 'encoding'), it is extremely easy to do it with Vim (if has(iconv) returns 1 of course), as follows:
        :e ++enc=Windows-1252 filename
        :setl fenc=utf-8
        :w

You ask what it means to use ":setglobal fileencoding=utf-8". That tells Vim what 'fileencoding' value to use when you create a new file which didn't exist before. Or you could use ":setglobal fileencoding=Windows-1252" which will create files by default in Windows-1252 encoding, but of course in that case you will get a signal at write-time (and not before) if you write in the file something that has no representation in Windows-1252. See ":help local-options".


++enc=something (before the filename in a file-read or file-write command such as :e or :saveas) tells Vim the 'fileencoding' to use for this read or write. When reading, it also sets 'fileencoding' (locally) for the file regardless of the 'fileencodings' heuristics. In spite of its name, this ++enc modifier has NOTHING TO DO with 'encoding' but only with 'fileencoding'.


'fileencodings' (plural) is a comma-separated list of values of 'fileencoding' (singular) to be tried when opening an editfile without the++enc modifier. They are tested from left to right in sequence:

- ucs-bom (if present) should be first. It will test the first few bytes of the first against the possible representations of U+FEFF in the various Unicode encodings. If found, and the rest of the file agrees with that particular encoding, it will set 'bomb' to true and 'fileencoding' to the corresponding encoding. In that case the heuristics ends there. Otherwise 'bomb' is set to false and the next encoding is tried.

- Any multibyte encoding (for instance utf-8) tests the contents of the file against the admissible character values for that encoding. If an error is found, the test ends there (gives a "fail" result) and the next encoding in sequence is tested. If the end of the file is reached with no error (all bytes and byte sequences are acceptable for that encoding), 'fileencoding' is set and the heuristics ands.

- An 8-byte encoding can never fail: it will set 'fileencoding' with no test. IOW there should be at most one 8-byte encoding, and it should be last. If there are more than one 8-byte encoding, Vim won't give an error, it will just never try anything (not even a multybyte encoding, if present) after the first 8-byte encoding.

- The value "default" is special: it means the value from your OS locale, i.e. the value which 'encoding' had before sourcing any startup script, even the system vimrc. It may be useful to put it last if you don't already try an 8-bit encoding before that.


Conclusion:
Vim has no built-in mechanism to sort Windows-1252, iso-8859-15 and Latin1 apart from each other. They are all 8-bit encodings, and sometimes one of the former two is used for the latter. You will have, for each of your files, to know which is which and, if necessary, use the appropriate ++enc modifier when reading it. This will set 'fileencoding' to what you tell Vim, and the same encoding will be used when writing. Just make sure that if you guess wrong, you notice it immediately, and read the file again in another 'fileencoding' before you modify it.



Best regards,
Tony.

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to vim_use+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to