Re: [PHP-I18N] Re: Problems with mime encoding of Japanese Characters in Subject

Dietrich Bollmann Thu, 08 May 2008 22:48:51 -0700

On Fri, 2008-05-09 at 12:56 +0900, Darren Cook wrote: 
> (This is a reply to a problem in the archives, from March:
>    http://marc.info/?l=php-i18n&m=120595161128203&w=2 )


Hi Darren,

Thanks for your answer!

I actually planned to write a more detailed answer - but then couldn't
find the time to finish it and finally lost everything I had prepared 
because of a HD problem :(

I often get email with mojibake (scrambled subjects etc.) here in Japan
and - after looking at other peoples code - had the impression that
most people handle with this kind of problem rather by trial-and-error
until their personal problem is solved than by reading the official
coding standards.  The code used by ec-cube for example seems to suffer
from the same problem I encountered.

The result are buggy email programs all over which often do not work
with anything else than JIS, SJIS and EUC.  And the only way to deal 
with their "personal" approach to character coding therefor seems to 
be to fall back to the same strategy: trial-and-error.

Not very satisfying. But I finally did the same thing as this seemed
to be the only way to make my problem vanish.  (See my hackish code
appended.)

Probably the answers I got from other people when asking them why
they do not use UTF-8 reflect this kind of experiences and the believes
caused by them rather than the facts about encoding methods
and standards...

Here a list with some of the answers I got (as I got them):

- Most cellars don't work with UTF-8.

...this is the one most important answer I got as lots of 
people in Japan use the time they spend in the subway to 
read and write their email with their cellar.

Only some cellars work with UTF-8, most don't. And I often
was told by friends that my email program (I normally use UTF-8) 
"doesn't work correctly" :)  More often I just didn't get any 
answer at all...

Based on this experience it is just natural that people don't
switch to UTF-8.  And even if more and more of the newer programs 
also work with UTF-8, probably it will still take a while until
this "tradition" in the Japanese software developer community
will change.

continuing with the answers:

- I don't like UTF-8.  It is too new, everybody is used to JIS and
  when using UTF-8 there are always lots of problems.
  With JIS things work well out of the box.

- The file size becomes bigger as there are so many different Characters
  which have to be encoded and Japanese characters are encoded with
  three bytes in UTF-8.

- There are too many different versions of UTF-8 which create problems.
  There is only one version of JIS which and therefor no version
  problems arise.

- I only use UTF-8 if absolutely necessary, for example when Chinese
  and Japanese texts are on the same page.

- when using UTF-8 the characters do not look nice.

- There is no need for UTF-8: Japanese and Ascii is all we need in
  normal circumstances, why bother about other languages?

- Similar Characters are grouped together
  and differences between similar Japanese Characters get lost.

- Doesn't look good.

- When only Chinese or only Japanese it looks good, when mixing 
  languages the Characters the page gets ugly.

> <minor rant>
> Later in the thread Tomas suggested using UTF-8 instead of
ISO-2022-JP,
> and getting Docomo to change. The problem is all those handsets in
> existence. Not to mention all the other legacy email clients that
don't
> work well with UTF-8, but real people still use. Docomo could convert
> from UTF-8 to ISO-2022-JP at the gateway of course, which apparently
is
> what softbank and kddi actually do, but Docomo deal with a lot of
email
> so care about the cost of the extra CPU cycles, and you're going to
need
> better motivation for them than "PHP cannot write proper MIME headers"
I
> suspect.
> </minor rant>

Yes, I agree.

> Darren
> 
> As you obviously have the mb_string extension installed, have you
tried
> using mb_send_mail() instead of mail()? Then you shouldn't need to
mess
> around encoding your own mimeheaders.
 
I do not remember well I have to admit.

The next time I will try again :)

> P.S. If still no luck, and you want to try writing your own solution
in
> PHP, I seem to have a function called jis_loop() in mail.inc in my
> open-source fclib ( http://dcook.org/software/fclib/ ) that does this.
> It is 5 years since I touched that file, and probably 7-8 years since
I
> wrote that function, and just looking at it now I cannot make head nor
> tail of it. So I'd regard that as a last resort :-)

Thanks for your help :)

Dietrich


Here is the code I finally used - it is kind of ugly and I currently
don't have the time to write a nicer version, sorry:

------------------------------
<?php // (emacs: -*- mode: php -*-) かな漢字 - save in UTF-8

  /**
     Encode and Send Japanese emails
     using ISO-2022-JP and mime header encoding.
   
     Emails with header fields encoded with the
     `mb_encode_mimeheader()' function cannot be decoded correctly.

     Long subjects and strings in other header fields (for example
     'From:' or the 'Reply-To:' values) are broken down into shorter
     strings by `mb_encode_mimeheader()' which cannot be decoded by
     (at least some) email programms.

     These problems have been encountered when using evolution 2.12.3
     as email reader, but users of other email programs reported the
     same errors when trying to decode emails encoded with
     `mv_encode_mimeheader()'.

     The code in this file uses several "work-arounds" found by
     experimenting with several coding strategies.  Emails formatted
     with `send_email_iso_2022_JP()' could be decoded correctly by
     evolution 2.12.3 and other email programs.
  */


  /**
     Encode `$string' by:
     - breaking it into chunks of at most 10 characters
     - base64 encoding each chunk
     - putting every base64 encoded chunk between the prefix
`=?ISO-2022-JP?B?' and the postfix `?='
     - assembling the encoded chunks separated with `$separator' into
one result string
  */
function encode_mimeheader_iso_2022_JP($string, $separator) {

    // Notes:
    // 
    // - The encoding seems to work only with "ISO-2022-JP" used to
characterize the encoding
    //   in the mime prefix; when using "JIS" at least my email reader
(evolution 2.12.3)
    //   is not able to decode the result.
    // 
    // - The separator has to be given as second element
    //   - a blank has to be used when encoding a name in the 'From:',
'Reply-To:' etc.
    //   - The separator '\r\n' seems to be the standard for encoding
the subject.
    //   I didn't verify if
    //   - a blank works also for the subject field (probably it does)
    //   - how / if other email programes work with the following code
    //     (I tested only with evolution 2.12.3)
    
    // convert `$string' to ISO-2022-JP (JIS)
    $encoding  = "ISO-2022-JP";
    $stringJIS = mb_convert_encoding($string, $encoding, "AUTO");

    // encode `$stringJIS'
    // - subdividing `$stringJIS' into character chunks of length `
$chunk_length' 
    // - encoding every chunk with base64
    // - putting the result between "=?ISO-2022-JP?B?" and "?="
    // - using `$separator' to separate the base64 encoded chunks
    $chunk_length = 10;
    $encoded = '';
    while ($length = mb_strlen($stringJIS)) {

// encode the next `$chunk_length' chars
$chunk            = mb_substr($stringJIS, 0, $chunk_length, $encoding);
$chunk_64encoded  = base64_encode($chunk);
$encoded .= sprintf("=?%s?B?%s?=%s", $encoding, $chunk_64encoded,
$separator);

// continue with the rest of the string
$stringJIS = mb_substr($stringJIS, $chunk_length, $length, $encoding);
    }

    // return the encoded string
    return $encoded;
  }

  /**
     Encode the email body using ISO-2022-JP (JIS).

     Note that the coding system has to be specified in the
Content-Type/charset header:
     Content-Type: text/plain; charset=ISO-2022-JP
  */
function encode_body_iso_2022_JP($body) {

    // encode body using JIS (ISO-2022-JP)
    return mb_convert_encoding($body, "ISO-2022-JP", "AUTO");
}

/**
   Format a Japanese email using ISO-2022-JP (JIS) and mime header
encoding.
*/
function send_email_iso_2022_JP($recipientEmailAddress, $subject, $body,
$senderName, $senderEmailAddress) {

    // set current language to Japanese
    mb_language("ja");

    // encode subject
    $subjectMIME = encode_mimeheader_iso_2022_JP($subject, "\r\n");     

    // encode the name of the sender
    $senderNameMIME = encode_mimeheader_iso_2022_JP($senderName, " ");

    // encode body
    $bodyJIS = encode_body_iso_2022_JP($body);

    // formatting the sender string
    if ($senderName && strlen($senderName) > 0) {

// encode the name of the sender
$senderNameMIME = encode_mimeheader_iso_2022_JP($senderName, " ");

// format email address
$senderMIME = sprintf("%s <%s>", $senderNameMIME, $senderEmailAddress);

    } else {

// format email address
$senderMIME = sprintf("%s", $senderEmailAddress);
    }

    // formatting the mime header
    $headers  = "MIME-Version: 1.0\r\n" ;
    $headers .= sprintf("From: %s\r\n",     $senderMIME);
    $headers .= sprintf("Reply-To: %s\r\n", $senderMIME);
    $headers .= "Content-Type: text/plain; charset=ISO-2022-JP\r\n";
   
    // send encoded mail
    $result = mail($recipientEmailAddress, $subjectMIME, $bodyJIS,
$headers);

    // return result
    return $result;
}

?>

------------------------------



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-I18N] Re: Problems with mime encoding of Japanese Characters in Subject

Reply via email to