Here's a copy of my report of this problem (mostly for my own use).

utf-8 problem.  The file you are now reading is a unicode file.  If it
looks corrupted in vim use :set encoding=utf-8.  Most Russian websites
use UTF-8 (unicode) using 2-byte Cyrillic unicode characters.  For the
2nd byte ??-?? maps to 80-8F hex and ??-?? maps to 90-9F.  Due to a bug,
lynx displays them as upper control codes.  Examples: ??=80=~@ ??=81=~A,
??=82=~B, ??=91=~Q, ??=9F=~_, etc.  So the resulting display is corrupted
and about half of the Russian characters get erroneously converted to
upper control characters representation (in ascii): ~A is like ^A
(control-A) but the high order bit is set. 

The above happens even though lynx has UTF-8 set which can be inspected
(or set) using the options menu by pressing "o" in lynx.  So the 2nd
byte of about half of the Russian characters gets converted to an upper
control code such as ~A.  What happens to the 1st utf-8 byte of the
character?  It gets converted to an error symbol.  This firest byte is
actually either D0 (for ??-??) or D1 (for ??-??).

What about the other half of the Russian characters that lynx displays
OK?  They all have D0 as the first byte and then Ax or Bx as the 2nd
byte (x=0,F).  Why do these come thru OK?  Well, they are not upper
control characters but they are above-ascii bytes with the high order
bit set (80-FF).  So some above-ascii bytes, A0-BF get displayed OK.

UTF-8 FILES NOT CORRUPTED
One can use lynx to save an internet UTF-8 page to a file and the
file is valid UTF-8, which displays OK on a UTF-8 terminal using
cat, more, or vim (using :set encoding = utf-8).  But the above
mentioned corruption (~A and err. chars.) happens if one tries to view
it with lynx or with vim (default encoding of 8859-15).  Viewing it
with most shows the same sort of corruption but instead of ~A one sees
<81> and instead of ~_ one sees <9F> etc. (the upper control codes get
displayed as ascii hex-numbers inside angle brackets).

OTHER CHARACTERS USED IN RUSSIAN
The above was only for alpha characters and ignores the use of other
characters used in Russian such as the French type quotes or the
letter e with 2 dots over it (the only alpha character not covered
above).  In unicode utf-8, they are not contiguous with the Russian
codes mentioned above since they are apparently also used in other
languages.  I can't seem to get them to type to this doc, but that's
another problem since Windows is supposed to generate them for my dumb
terminal (Putty emulation of UTF-8 on a Windows laptop).  As far as
displaying such other characters correctly, fixing the ~A problem
first may help.

TOOLS USED FOR INVESTIGATING
Besides using cat, more, most, and vim to inspect files, one may use hd
(hex dump) to look at the hex contents alongside the UTF-8 glyphs.  It
doesn't seem to parse unicode byte pairs correctly when the pair is
split between two lines of display.  But if one ignores bytes at the
end or beginning of lines, it's useful.

                        David Lawyer




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to