>> Really. I'm not making this up. :-/ > >No, I don't think you are. I think that line in both files is correctly >UTF-8 encoded.
And now that you've explained what's going on, it's clear that you're right. >vim isn't the vi(1) I grew up with, and probably you too. Definitely. The first time I used vi was in 1984, on a 68000-based Cadmus system. >Try ‘:se fileencoding?’ when vim-ing good and again with bad. Good point: $ vim good :set fileencoding fileencoding=utf-8 $ vim bad :set fileencoding fileencoding=latin1 >I expect the bad file has something earlier on which fixes vim's idea of >the encoding to ISO 8859-1 That does seem to be the case. Do you have any idea what kind of thing that might be? (I know you can't diagnose a file you haven't seen, but in general, what sorts of things should I look for?) >> But wait. It gets worse: >> >> $ grep -n ^Veuillez good | cut -c1-68 >> 108:Veuillez ne pas répondre au présent courriel. Il a été gén� >> >> $ grep -n ^Veuillez bad | cut -c1-68 >> 108:Veuillez ne pas répondre au présent courriel. Il a été gén� > >The worse being it is the very same line 108 you're seeing in vim which >grep is also showing? Exactly, because... >(The ‘�’ at the end is to be expected.) ...this is still more evidence that you know more about character sets and conversions than I do. As if further evidence was needed at this point. :-/ Until now, I've only ever seen that glyph when a character doesn't exist in the font being used -- but that can't be the case here because that same character is shown correctly five times in the same line of output. Why is it to be expected? >> $ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet >> [...] > >I don't understand that. The -p sets up a loop to read a line from >good_snippet, do the substitution on it, and print the result, until >EOF. The -l strips off the linefeed on input and puts it back on the >output. The substitution in between changes all bytes, thanks to >LC_ALL=C, which aren't space to tilde into a ‘<42>’ string representing >their hex value. Thank you for explaining that. Just for fun, I tried the following in tcsh: $ setenv LC_ALL C $ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet Veuillez ne pas r<c3><a9>pondre au pr<c3><a9>sent courriel. Il a <c3><a9>t<c3><a9> g<c3><a9>n<c3><a9>r<c3><a9> As expected, this returned pretty much instantly. Then I tried this: $ sh $ LC_ALL=C $ echo $LC_ALL C $ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet ...and that also hung. Which in a way is good, because at least it means bash is behaving consistently. But also not good, because it's behaving badly. :-/ On my system, /bin/sh is a symlink to /bin/bash, which is version 5.1.016-2 as packaged by Manjaro. ...but troubleshooting bash is far outside the scope of this discussion, so I propose to forget this particular clupea harengus of the crimson variety unless you find it interesting in and of itself. >Nothing wrong with od(1). If you have hexdump(1) installed then it with >-C gives quite nice output. Yes, I see (or -C? :-). Thanks for that tip; I hadn't known that hexdump existed. >> ...and both snippets are identical! > >Well, those lines were identical to start with before snipping. >You could confirm this with > > cmp <(sed -n 108p good) <(sed -n 108p bad) As written, this also hangs in bash (and is invalid syntax in tcsh). But it's effectively equivalent to $ sed -n 108p good > good.sed $ sed -n 108p bad > bad.sed $ cmp good.sed bad.sed $ echo $? 0 ...which behaves as expected. >> Strangely, both snippet files look fine in vim. > >Because you have chopped off the non-UTF-8 which occurs earlier in bad >which fixes vim's idea of the file's encoding. In retrospect this should have been obvious. :-/ >> ...but for the bad file, that becomes >> >> "bad" [converted] 336 lines, 49471 bytes 1,1 Top > >Ta-da! Indeed. :-) Thank you. - Steven -- ___________________________________________________________________________ Steven Winikoff | Montreal, QC, Canada | Eschew obfuscation. s...@smwonline.ca | http://smwonline.ca |