Here's a copy of my report of this problem (mostly for my own use). utf-8 problem. The file you are now reading is a unicode file. If it looks corrupted in vim use :set encoding=utf-8. Most Russian websites use UTF-8 (unicode) using 2-byte Cyrillic unicode characters. For the 2nd byte ??-?? maps to 80-8F hex and ??-?? maps to 90-9F. Due to a bug, lynx displays them as upper control codes. Examples: ??=80=~@ ??=81=~A, ??=82=~B, ??=91=~Q, ??=9F=~_, etc. So the resulting display is corrupted and about half of the Russian characters get erroneously converted to upper control characters representation (in ascii): ~A is like ^A (control-A) but the high order bit is set.
The above happens even though lynx has UTF-8 set which can be inspected (or set) using the options menu by pressing "o" in lynx. So the 2nd byte of about half of the Russian characters gets converted to an upper control code such as ~A. What happens to the 1st utf-8 byte of the character? It gets converted to an error symbol. This firest byte is actually either D0 (for ??-??) or D1 (for ??-??). What about the other half of the Russian characters that lynx displays OK? They all have D0 as the first byte and then Ax or Bx as the 2nd byte (x=0,F). Why do these come thru OK? Well, they are not upper control characters but they are above-ascii bytes with the high order bit set (80-FF). So some above-ascii bytes, A0-BF get displayed OK. UTF-8 FILES NOT CORRUPTED One can use lynx to save an internet UTF-8 page to a file and the file is valid UTF-8, which displays OK on a UTF-8 terminal using cat, more, or vim (using :set encoding = utf-8). But the above mentioned corruption (~A and err. chars.) happens if one tries to view it with lynx or with vim (default encoding of 8859-15). Viewing it with most shows the same sort of corruption but instead of ~A one sees <81> and instead of ~_ one sees <9F> etc. (the upper control codes get displayed as ascii hex-numbers inside angle brackets). OTHER CHARACTERS USED IN RUSSIAN The above was only for alpha characters and ignores the use of other characters used in Russian such as the French type quotes or the letter e with 2 dots over it (the only alpha character not covered above). In unicode utf-8, they are not contiguous with the Russian codes mentioned above since they are apparently also used in other languages. I can't seem to get them to type to this doc, but that's another problem since Windows is supposed to generate them for my dumb terminal (Putty emulation of UTF-8 on a Windows laptop). As far as displaying such other characters correctly, fixing the ~A problem first may help. TOOLS USED FOR INVESTIGATING Besides using cat, more, most, and vim to inspect files, one may use hd (hex dump) to look at the hex contents alongside the UTF-8 glyphs. It doesn't seem to parse unicode byte pairs correctly when the pair is split between two lines of display. But if one ignores bytes at the end or beginning of lines, it's useful. David Lawyer -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org