On Monday, June 11, 2001 4:14 AM, Vadim Snurnikov wrote:
> How can I read a text in Unicode (Russian) where every Russian letter
> is represented like that: D=B6 (or similar)? Unfortunately, all these
> four characters that stand for one Russian letter are of one byte each,
> so that I am getting 4 bytes for every Russian letter. (The e-mail got
> transferred to this format.)
The format of an E-Mail message should be described in its MIME headers.
E. g., in a message containing the headers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
the text is encoded twice: the Unicode characters are first encoded
in UTF-8 (which has 8-bit coding units), and then the result is
encoded in MIME quoted-printable (which has 7-bit coding units).
Thus, the Russian word "Ya " (meaning "I ") ends up in 7 bytes, viz.
"=D0=AF ".
To interpret this, you have to undo the two encodings, in reverse order:
1. Undo the quoted-printable; from the above example, this will yield
three bytes, viz. D0 AF 20.
2. Undo the UTF-8 encoding; in the example, you'll get U+042F U+0020
(Cyrillic Capital Ya, Space).
Depending on the encodings chosen, the details may vary, of course.
Note that cyrillic has several popular encodings (in addition to Unicode),
cf. <http://czyborra.com/charsets/cyrillic.html>. All of these would look
superficially alike, in MIME quoted-printable encoding.
The "=" is characteristic for quoted-printable; however UTF-8
followed by quoted-printable will yield 6 bytes for every cyrillic
character (as "=D0=AF", in the example given), and 1 byte for every
ASCII character (as the space in the example given), and former cyrillic
encodings (cf. supra) followed by quoted printable will encode cyrillic
character in 3 bytes each (and again ASCII ones in 1 byte each).
I am not aware of any e-mail encoding scheme that will encode cyrillic
characters in four bytes each, such as the "D=B6" sequence originally
quoted.
> Is there a tool to transfer this back into 2-byte-encoding or to any
> other readable form?
Every decent, contemporary e-mail client should do this automagically,
provided the headers have not been removed, or the mail has been other-
wise distorted. (Deplorably, all WWW-mail servers I have tested so far
remove the MIME headers before they have properly undone both encodings,
thus corrupting the message in one way or the other.)
So my advice is:
- install Unicode fonts, comprising at least the WGL4 repertoire
(cf. <http://www.hclrss.demon.co.uk/unicode/fonts.html#wgl4>),
- collect your mail from a POP3, or an IMAP, server (not from a HTTP
server via some mail-WWW interface),
- use the current version of your favourite e-mail client.
I have tested
· Messenger from Netscape 6.0, and it does it right,
though it exhibited some teething troubles;
· Eudora 5.1, which is not even capable of displaying cyrillic text
from the Windows 98 clipboard, its doc has nothing whatsoever on
UTF-8 or Unicode or charsets, and its menus do not mention character
encoding; hence, it probably does not interpret UTF-8 encoded messages,
either (which I cannot test tonight).
I have not yet tested:
· Outlook from Internet Explorer 5 (which is promising, as its
browser has the most thorough UTF-8 support I have seen so far).
Best wishes,
Otto Stolz