Re: FW: Russian Unicode Convertion

Otto Stolz Tue, 12 Jun 2001 14:47:00 -0700
On Monday, June 11, 2001 4:14 AM, Vadim Snurnikov wrote:
> How can I read a text in Unicode (Russian) where every Russian letter
> is represented like that: D=B6 (or similar)?  Unfortunately, all these
> four characters that stand for one Russian letter are of one byte each,
> so that I am getting 4 bytes for every Russian letter.  (The e-mail got
> transferred to this format.)

The format of an E-Mail message should be described in its MIME headers.
E. g., in a message containing the headers
  MIME-Version: 1.0
  Content-Type: text/plain; charset=UTF-8
  Content-Transfer-Encoding: quoted-printable
the text is encoded twice: the Unicode characters are first encoded
in UTF-8 (which has 8-bit coding units), and then the result is
encoded in MIME quoted-printable (which has 7-bit coding units).
Thus, the Russian word "Ya " (meaning "I ") ends up in 7 bytes, viz.
"=D0=AF ".

To interpret this, you have to undo the two encodings, in reverse order:
1. Undo the quoted-printable; from the above example, this will yield
   three bytes, viz. D0 AF 20.
2. Undo the UTF-8 encoding; in the example, you'll get U+042F U+0020
   (Cyrillic Capital Ya, Space).

Depending on the encodings chosen, the details may vary, of course.

Note that cyrillic has several popular encodings (in addition to Unicode),
cf. <http://czyborra.com/charsets/cyrillic.html>. All of these would look
superficially alike, in MIME quoted-printable encoding.

The "=" is characteristic for quoted-printable; however UTF-8
followed by quoted-printable will yield 6 bytes for every cyrillic
character (as "=D0=AF", in the example given), and 1 byte for every
ASCII character (as the space in the example given), and former cyrillic
encodings (cf. supra) followed by quoted printable will encode cyrillic
character in 3 bytes each (and again ASCII ones in 1 byte each).
I am not aware of any e-mail encoding scheme that will encode cyrillic
characters in four bytes each, such as the "D=B6" sequence originally
quoted.

> Is there a tool to transfer this back into 2-byte-encoding or to any
> other readable form?

Every decent, contemporary e-mail client should do this automagically,
provided the headers have not been removed, or the mail has been other-
wise distorted. (Deplorably, all WWW-mail servers I have tested so far
remove the MIME headers before they have properly undone both encodings,
thus corrupting the message in one way or the other.)

So my advice is:
- install Unicode fonts, comprising at least the WGL4 repertoire
  (cf. <http://www.hclrss.demon.co.uk/unicode/fonts.html#wgl4>),
- collect your mail from a POP3, or an IMAP, server (not from a HTTP
  server via some mail-WWW interface),
- use the current version of your favourite e-mail client.
  I have tested
  · Messenger from Netscape 6.0, and it does it right,
    though it exhibited some teething troubles;
  · Eudora 5.1, which is not even capable of displaying cyrillic text
    from the Windows 98 clipboard, its doc has nothing whatsoever on
    UTF-8 or Unicode or charsets, and its menus do not mention character
    encoding; hence, it probably does not interpret UTF-8 encoded messages,
    either (which I cannot test tonight).
  I have not yet tested:
  · Outlook from Internet Explorer 5 (which is promising, as its
    browser has the most thorough UTF-8 support I have seen so far).

Best wishes,
  Otto Stolz
Re: FW: Russian Unicode Convertion

Reply via email to