On 08/12/2020 17.47, Bram Moolenaar wrote:
This works:
:set fencs=utf8
:%!cat
although "fenc" remains "latin1".
Yeah, for an existing buffer and filtering the first entry in 'fencs' is
used to read the filter output, but 'fenc' isn't set.  That's a bit
strange, but I'm not sure what would break if we change this.  It might
actually be good to fix this, since if you write that file it might get
messed up.

I performed a couple of tests trying to write the result to a file after doing the above (using a correct UTF-8 file as source): - if you leave fenc to latin1 the new file will be in latin1 (with all the characters correctly encoded) - if you set fenc to utf8 *after* the %!cat (but of course before writing the file) the new file will be in UTF-8 with all the characters correctly encoded - if you set fenc to utf8 *before* the %!cat (and of course before writing the file) the new file will be... a mess: by all appearances Vim thinks that the individual bytes of the UTF-8 file are individual latin1 characters, and it then converts them to UTF-8; so you'll get a UTF-8 encoded file with the wrong characters, e.g. a "C3 B2" sequence in the original file, which stands for a UTF-8 encoded "ò", (Unicode code point F2) will become a "C3 83 C2 B2" sequence in the written file: "C3" is a "Â" in latin1 (and yes, in Unicode too), and "Â" is encoded as "C3 83" in UTF-8, "B2" is a "²" in latin1 (and Unicode) and "²" is encoded as "C2 B2" in UTF-8 (in case someone noticed it, don't let yourself get confused by the fact that C3 and B2 occur both in the source and the translated sequence, that's largely just an unfortunate coincidence of my example).

Given that Unicode is identical to latin1 in the first 256 characters, to better confirm what happened I also tried using another charset (cp850) instead of latin1 in the above tests (fencs=cp850 in my vimrc and setting fenc=cp850 in the second and third tests), still using a correct UTF-8 file as a source; the results are analogous, with a correct cp850 file in the first test, a correct UTF-8 one in the second and a UTF-8 one with the original file's bytes interpreted as cp850 and then converted to UTF-8 in the third (the original "ò", "C3 83", becomes a "E2 94 9C E2 96 93" sequence, given that "C3" is a "├" symbol in cp850, Unicode code point 251C ->  "E2 94 9C" UTF-8, and 83 is a "▓", Unicode code point 2593 -> "E2 96 93" UTF-8).

Yes, I... ahem, had a lot of fun this afternoon :D


Cheers

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to vim_use+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_use/d90f2dd2-ef6a-fb16-0118-4f30dc238aba%40tiscali.it.

Reply via email to