php-i18n Digest 10 May 2008 06:35:15 -0000 Issue 391

Topics (messages 1175 through 1182):

Re: proposal: unification of the grapheme_extract functions
        1175 by: Ed Batutis
        1176 by: Stanislav Malyshev
        1177 by: Texin, Tex
        1178 by: Ed Batutis

ubuntu 7.10 pecl install intl
        1179 by: Darren Cook

Re: Problems with mime encoding of Japanese Characters in Subject
        1180 by: Darren Cook
        1181 by: Dietrich Bollmann

Re: intl extension
        1182 by: Gergely Hodicska

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
> > I am proposing to unify the three grapheme_extract functions this way:
> > ...

The change is checked in.

=Ed



--- End Message ---
--- Begin Message ---
Hi!

The change is checked in.

Thanks!
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--- End Message ---
--- Begin Message ---
Ed,

 If I use GRAPHEME_EXTR_MAXBYTES, does it return that exact number of bytes, or 
either the maximum number of whole unicode characters or whole graphemes that 
can be extracted without exceeding the max bytes?

I assume it is the max # of whole graphemes that do not exceed the max bytes. 
The use case is to have a limited storage area for unicode text such as a 
filename that is limited to 8 bytes, or mail with a subject heading limited to 
60 (pick a number) bytes. It is needed to make the string a proper unicode 
string that displays meaningfully. So if I extract from a larger field, I want 
the max number of graphemes that fit in my storage space. 

Is that what it does?

Also, the $start value is that in byte, character or grapheme units for each of 
the types?

tex


> -----Original Message-----
> From: Ed Batutis [mailto:[EMAIL PROTECTED] 
> Sent: Friday, May 02, 2008 2:49 PM
> To: [EMAIL PROTECTED]
> Subject: [PHP-I18N] proposal: unification of the 
> grapheme_extract functions
> 
> Hi,
> 
> I am proposing to unify the three grapheme_extract functions this way:
> 
> string grapheme_extract  ( string $haystack  , 
>                            int $size 
>                            [, int $extract_type  
>                            [, string $start  ]] )
> 
> where $extract_type is:
> 
> GRAPHEME_EXTR_COUNT - $size is number of graphemes (default) 
> GRAPHEME_EXTR_MAXBYTES - $size is maximum number of bytes to 
> extract GRAPHEME_EXTR_MAXCHARS - $size is maximum number of 
> UTF-8 character to extract
> 
> and the other arguments are as in the current set of extract 
> functions.
> 
> Sorry if I missed someone's proposal for this - I am only on 
> the php-i18n list at this point. Please post your proposal to 
> this list, if possible.
> 
> Thanks,
> 
> =Ed
> 
> 
> 
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/) To 
> unsubscribe, visit: http://www.php.net/unsub.php
> 
> 

--- End Message ---
--- Begin Message ---
>  If I use GRAPHEME_EXTR_MAXBYTES, does it return ...

> I assume it is the max # of whole graphemes that do not exceed the max
> bytes.

Yes. It works just like the old grapheme_extractb.

> Also, the $start value is that in byte, character or grapheme units for
> each of the types?

The start value is always bytes. I was unsure if this made sense, really,
but it is consistent (and easy to implement).

=Ed



--- End Message ---
--- Begin Message ---
Hi,
Has anyone managed to install "intl" pecl extension on ubuntu?

On my first try phpize wasn't found. So I installed php5-dev package,
which solved that.

Next i get the mysterious question in [1]. I think 'all' needs the
explanation that this is what you type if you want to specify the path??

I try just pressing Enter and get lots of output ending in:
 ERROR: `/tmp/pear/temp/intl/configure --with-icu-dir=DEFAULT' failed

I have icu 3.6 installed (and icu-dev package), and "whereis icu" tells me:
   icu: /usr/lib/icu /usr/share/icu

So, I tried answering "all" (without the quotes). Then I gave
/usr/lib/icu, but same problem. Full output is in [2] below.
Giving /usr/share/icu gave the same error.

If someone can suggest what package I am missing, or what I am doing
wrong, I'd be very grateful!

Darren



[1]:
 75 source files, building
running: phpize
Configuring for:
PHP Api Version:         20041225
Zend Module Api No:      20060613
Zend Extension Api No:   220060519
 1. Specify where ICU libraries and headers can be found : DEFAULT

1-1, 'all', 'abort', or Enter to continue:


[2]:

# pecl install intl
downloading intl-1.0.0beta.tgz ...
Starting to download intl-1.0.0beta.tgz (96,707 bytes)
.....................done: 96,707 bytes
75 source files, building
running: phpize
Configuring for:
PHP Api Version:         20041225
Zend Module Api No:      20060613
Zend Extension Api No:   220060519
 1. Specify where ICU libraries and headers can be found : DEFAULT

1-1, 'all', 'abort', or Enter to continue: all
Specify where ICU libraries and headers can be found [DEFAULT] :
/usr/lib/icu/
 1. Specify where ICU libraries and headers can be found : /usr/lib/icu/

1-1, 'all', 'abort', or Enter to continue:
building in /var/tmp/pear-build-root/intl-1.0.0beta
running: /tmp/pear/temp/intl/configure --with-icu-dir=/usr/lib/icu/
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for a sed that does not truncate output... /bin/sed
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc and cc understand -c and -o together... yes
checking if compiler supports -R... no
checking if compiler supports -Wl,-rpath,... yes
checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking target system type... i686-pc-linux-gnu
checking for PHP prefix... /usr
checking for PHP includes... -I/usr/include/php5
-I/usr/include/php5/main -I/usr/include/php5/TSRM
-I/usr/include/php5/Zend -I/usr/include/php5/ext
-I/usr/include/php5/ext/date/lib -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
checking for PHP extension directory... /usr/lib/php5/20060613+lfs
checking for PHP installed headers prefix... /usr/include/php5
checking for re2c... no
configure: WARNING: You will need re2c 0.9.11 or later if you want to
regenerate PHP parsers.
checking for gawk... no
checking for nawk... nawk
checking if nawk is broken... no
checking whether to enable internationalization support... yes, shared
/tmp/pear/temp/intl/configure: line 3838: syntax error near unexpected
token `INTL_SHARED_LIBADD'
/tmp/pear/temp/intl/configure: line 3838: `
PHP_SETUP_ICU(INTL_SHARED_LIBADD)'
ERROR: `/tmp/pear/temp/intl/configure --with-icu-dir=/usr/lib/icu/' failed



-- 
Darren Cook
http://dcook.org/mlsn/ (English-Japanese-German-Chinese free dictionary)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)

--- End Message ---
--- Begin Message ---
(This is a reply to a problem in the archives, from March:
   http://marc.info/?l=php-i18n&m=120595161128203&w=2 )

As you obviously have the mb_string extension installed, have you tried
using mb_send_mail() instead of mail()? Then you shouldn't need to mess
around encoding your own mimeheaders.

<minor rant>
Later in the thread Tomas suggested using UTF-8 instead of ISO-2022-JP,
and getting Docomo to change. The problem is all those handsets in
existence. Not to mention all the other legacy email clients that don't
work well with UTF-8, but real people still use. Docomo could convert
from UTF-8 to ISO-2022-JP at the gateway of course, which apparently is
what softbank and kddi actually do, but Docomo deal with a lot of email
so care about the cost of the extra CPU cycles, and you're going to need
better motivation for them than "PHP cannot write proper MIME headers" I
suspect.
</minor rant>

Darren

P.S. If still no luck, and you want to try writing your own solution in
PHP, I seem to have a function called jis_loop() in mail.inc in my
open-source fclib ( http://dcook.org/software/fclib/ ) that does this.
It is 5 years since I touched that file, and probably 7-8 years since I
wrote that function, and just looking at it now I cannot make head nor
tail of it. So I'd regard that as a last resort :-)


-- 
Darren Cook
http://dcook.org/mlsn/ (English-Japanese-German-Chinese free dictionary)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)

--- End Message ---
--- Begin Message ---
On Fri, 2008-05-09 at 12:56 +0900, Darren Cook wrote: 
> (This is a reply to a problem in the archives, from March:
>    http://marc.info/?l=php-i18n&m=120595161128203&w=2 )

Hi Darren,

Thanks for your answer!

I actually planned to write a more detailed answer - but then couldn't
find the time to finish it and finally lost everything I had prepared 
because of a HD problem :(

I often get email with mojibake (scrambled subjects etc.) here in Japan
and - after looking at other peoples code - had the impression that
most people handle with this kind of problem rather by trial-and-error
until their personal problem is solved than by reading the official
coding standards.  The code used by ec-cube for example seems to suffer
from the same problem I encountered.

The result are buggy email programs all over which often do not work
with anything else than JIS, SJIS and EUC.  And the only way to deal 
with their "personal" approach to character coding therefor seems to 
be to fall back to the same strategy: trial-and-error.

Not very satisfying. But I finally did the same thing as this seemed
to be the only way to make my problem vanish.  (See my hackish code
appended.)

Probably the answers I got from other people when asking them why
they do not use UTF-8 reflect this kind of experiences and the believes
caused by them rather than the facts about encoding methods
and standards...

Here a list with some of the answers I got (as I got them):

- Most cellars don't work with UTF-8.

...this is the one most important answer I got as lots of 
people in Japan use the time they spend in the subway to 
read and write their email with their cellar.

Only some cellars work with UTF-8, most don't. And I often
was told by friends that my email program (I normally use UTF-8) 
"doesn't work correctly" :)  More often I just didn't get any 
answer at all...

Based on this experience it is just natural that people don't
switch to UTF-8.  And even if more and more of the newer programs 
also work with UTF-8, probably it will still take a while until
this "tradition" in the Japanese software developer community
will change.

continuing with the answers:

- I don't like UTF-8.  It is too new, everybody is used to JIS and
  when using UTF-8 there are always lots of problems.
  With JIS things work well out of the box.

- The file size becomes bigger as there are so many different Characters
  which have to be encoded and Japanese characters are encoded with
  three bytes in UTF-8.

- There are too many different versions of UTF-8 which create problems.
  There is only one version of JIS which and therefor no version
  problems arise.

- I only use UTF-8 if absolutely necessary, for example when Chinese
  and Japanese texts are on the same page.

- when using UTF-8 the characters do not look nice.

- There is no need for UTF-8: Japanese and Ascii is all we need in
  normal circumstances, why bother about other languages?

- Similar Characters are grouped together
  and differences between similar Japanese Characters get lost.

- Doesn't look good.

- When only Chinese or only Japanese it looks good, when mixing 
  languages the Characters the page gets ugly.

> <minor rant>
> Later in the thread Tomas suggested using UTF-8 instead of
ISO-2022-JP,
> and getting Docomo to change. The problem is all those handsets in
> existence. Not to mention all the other legacy email clients that
don't
> work well with UTF-8, but real people still use. Docomo could convert
> from UTF-8 to ISO-2022-JP at the gateway of course, which apparently
is
> what softbank and kddi actually do, but Docomo deal with a lot of
email
> so care about the cost of the extra CPU cycles, and you're going to
need
> better motivation for them than "PHP cannot write proper MIME headers"
I
> suspect.
> </minor rant>

Yes, I agree.

> Darren
> 
> As you obviously have the mb_string extension installed, have you
tried
> using mb_send_mail() instead of mail()? Then you shouldn't need to
mess
> around encoding your own mimeheaders.
 
I do not remember well I have to admit.

The next time I will try again :)

> P.S. If still no luck, and you want to try writing your own solution
in
> PHP, I seem to have a function called jis_loop() in mail.inc in my
> open-source fclib ( http://dcook.org/software/fclib/ ) that does this.
> It is 5 years since I touched that file, and probably 7-8 years since
I
> wrote that function, and just looking at it now I cannot make head nor
> tail of it. So I'd regard that as a last resort :-)

Thanks for your help :)

Dietrich


Here is the code I finally used - it is kind of ugly and I currently
don't have the time to write a nicer version, sorry:

------------------------------
<?php // (emacs: -*- mode: php -*-) かな漢字 - save in UTF-8

  /**
     Encode and Send Japanese emails
     using ISO-2022-JP and mime header encoding.
   
     Emails with header fields encoded with the
     `mb_encode_mimeheader()' function cannot be decoded correctly.

     Long subjects and strings in other header fields (for example
     'From:' or the 'Reply-To:' values) are broken down into shorter
     strings by `mb_encode_mimeheader()' which cannot be decoded by
     (at least some) email programms.

     These problems have been encountered when using evolution 2.12.3
     as email reader, but users of other email programs reported the
     same errors when trying to decode emails encoded with
     `mv_encode_mimeheader()'.

     The code in this file uses several "work-arounds" found by
     experimenting with several coding strategies.  Emails formatted
     with `send_email_iso_2022_JP()' could be decoded correctly by
     evolution 2.12.3 and other email programs.
  */


  /**
     Encode `$string' by:
     - breaking it into chunks of at most 10 characters
     - base64 encoding each chunk
     - putting every base64 encoded chunk between the prefix
`=?ISO-2022-JP?B?' and the postfix `?='
     - assembling the encoded chunks separated with `$separator' into
one result string
  */
function encode_mimeheader_iso_2022_JP($string, $separator) {

    // Notes:
    // 
    // - The encoding seems to work only with "ISO-2022-JP" used to
characterize the encoding
    //   in the mime prefix; when using "JIS" at least my email reader
(evolution 2.12.3)
    //   is not able to decode the result.
    // 
    // - The separator has to be given as second element
    //   - a blank has to be used when encoding a name in the 'From:',
'Reply-To:' etc.
    //   - The separator '\r\n' seems to be the standard for encoding
the subject.
    //   I didn't verify if
    //   - a blank works also for the subject field (probably it does)
    //   - how / if other email programes work with the following code
    //     (I tested only with evolution 2.12.3)
    
    // convert `$string' to ISO-2022-JP (JIS)
    $encoding  = "ISO-2022-JP";
    $stringJIS = mb_convert_encoding($string, $encoding, "AUTO");

    // encode `$stringJIS'
    // - subdividing `$stringJIS' into character chunks of length `
$chunk_length' 
    // - encoding every chunk with base64
    // - putting the result between "=?ISO-2022-JP?B?" and "?="
    // - using `$separator' to separate the base64 encoded chunks
    $chunk_length = 10;
    $encoded = '';
    while ($length = mb_strlen($stringJIS)) {

// encode the next `$chunk_length' chars
$chunk            = mb_substr($stringJIS, 0, $chunk_length, $encoding);
$chunk_64encoded  = base64_encode($chunk);
$encoded .= sprintf("=?%s?B?%s?=%s", $encoding, $chunk_64encoded,
$separator);

// continue with the rest of the string
$stringJIS = mb_substr($stringJIS, $chunk_length, $length, $encoding);
    }

    // return the encoded string
    return $encoded;
  }

  /**
     Encode the email body using ISO-2022-JP (JIS).

     Note that the coding system has to be specified in the
Content-Type/charset header:
     Content-Type: text/plain; charset=ISO-2022-JP
  */
function encode_body_iso_2022_JP($body) {

    // encode body using JIS (ISO-2022-JP)
    return mb_convert_encoding($body, "ISO-2022-JP", "AUTO");
}

/**
   Format a Japanese email using ISO-2022-JP (JIS) and mime header
encoding.
*/
function send_email_iso_2022_JP($recipientEmailAddress, $subject, $body,
$senderName, $senderEmailAddress) {

    // set current language to Japanese
    mb_language("ja");

    // encode subject
    $subjectMIME = encode_mimeheader_iso_2022_JP($subject, "\r\n");     

    // encode the name of the sender
    $senderNameMIME = encode_mimeheader_iso_2022_JP($senderName, " ");

    // encode body
    $bodyJIS = encode_body_iso_2022_JP($body);

    // formatting the sender string
    if ($senderName && strlen($senderName) > 0) {

// encode the name of the sender
$senderNameMIME = encode_mimeheader_iso_2022_JP($senderName, " ");

// format email address
$senderMIME = sprintf("%s <%s>", $senderNameMIME, $senderEmailAddress);

    } else {

// format email address
$senderMIME = sprintf("%s", $senderEmailAddress);
    }

    // formatting the mime header
    $headers  = "MIME-Version: 1.0\r\n" ;
    $headers .= sprintf("From: %s\r\n",     $senderMIME);
    $headers .= sprintf("Reply-To: %s\r\n", $senderMIME);
    $headers .= "Content-Type: text/plain; charset=ISO-2022-JP\r\n";
   
    // send encoded mail
    $result = mail($recipientEmailAddress, $subjectMIME, $bodyJIS,
$headers);

    // return result
    return $result;
}

?>

------------------------------



--- End Message ---
--- Begin Message ---
Hi!


I have released a beta version of the intl extension (ICU implementation). It is documented here:
http://docs.php.net/manual/en/book.intl.php
I just started to get familiar with intl while I read your article in
phparchitect and I like a lot of features of this extension.

I would have some question:
- If I should have to came out to a site with multi language support in
1-2 months is it worth to rely on this extension? (Will be part of 5.3?)
- Is there now standard like way to handle plurals?

TIA!


Best Regards,
Felhő


--- End Message ---

Reply via email to