php-i18n Digest 3 Feb 2010 07:32:27 -0000 Issue 437
Topics (messages 1354 through 1369):
Re: adding GB18030 support for mbstring
1354 by: Tex Texin
1362 by: KITAZAKI Shigeru
1363 by: KITAZAKI Shigeru
1364 by: Tex Texin
1365 by: Moriyoshi Koizumi
1366 by: Moriyoshi Koizumi
1367 by: Moriyoshi Koizumi
1368 by: Tex Texin
1369 by: Moriyoshi Koizumi
Re: [PHP-DEV] RE: [PHP-I18N] adding GB18030 support for mbstring
1355 by: Pierre Joye
1356 by: Stanislav Malyshev
1357 by: Tex Texin
1358 by: Pierre Joye
1359 by: Stanislav Malyshev
1360 by: Tex Texin
1361 by: Moriyoshi Koizumi
Administrivia:
To subscribe to the digest, e-mail:
[email protected]
To unsubscribe from the digest, e-mail:
[email protected]
To post to the list, e-mail:
[email protected]
----------------------------------------------------------------------
--- Begin Message ---
Since ICU supports many conversions, including gb18030, and is regularly
updated and is already a part of php, it makes no sense to include
individually written conversions. ICU also gets considerable testing and
review.
Conversions should all be driven through icu.
-----Original Message-----
From: Moriyoshi Koizumi [mailto:[email protected]]
Sent: Sunday, January 31, 2010 11:29 PM
To: KITAZAKI Shigeru
Cc: [email protected]; [email protected]
Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
Kitazaki-san,
First thank you for your effort. But, I am under the impression that
the conversion table looks too huge to include in a distribution
(>30MB). Is there any way to get this more compressed?
BTW, I created an extension that is near-compatible with mbstring and
based on ICU that of course supports GB18030. See
http://github.com/moriyoshi/mbstring-ng for detail.
Regards,
Moriyoshi
2010/1/28 KITAZAKI Shigeru <[email protected]>:
> We made a patch to add a mbfilter for GB18030 encoding for PHP-5.3.1.
> Please take a look at our blog:
> http://developer.cybozu.co.jp/oss/2010/01/php-mbstring-pa.html
>
> We would appreciate if you take this patch into the mainline.
>
> BTW, our blog has various other patches for PHP in addition to this one.
> Feel free to mail me if you are interested in some of them.
>
> Regards,
> KITAZAKI Shigeru <[email protected]>
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
hi,
Moriyoshi Koizumi wrote:
> First thank you for your effort. But, I am under the impression that
> the conversion table looks too huge to include in a distribution
> (>30MB). Is there any way to get this more compressed?
>
> BTW, I created an extension that is near-compatible with mbstring and
> based on ICU that of course supports GB18030. See
> http://github.com/moriyoshi/mbstring-ng for detail.
I worry about it, too, because GB18030 has very large amount of characters.
If it's possible to calculate correspondent code points between unicode and
GB18030 at run time, we can reduce them probably. I, however, don't figure
out the way not to use the conversion table yet.
And, I'm thankful to your excellent work, mbstring-ng. Is there any
milestones to include it in the mainline? We hope it is available on both
MS Windows and POSIX-based system such as Linux. they use different
config.* macros and different compilers, and so on.
If there are more people who want to use GB18030 with 'mbstring', it had
better to add new encodings although I don't know how many people want it.
At least, I want to use GB18030 encoding in the same manner with existing
ones.
Pierre Joye wrote:
> However it is obvious that the mid/long term goal should be to replace
> it completely with ICU.
I agree with you, of course, in our not so ideal world :)
Thanks,
Shigeru
--- End Message ---
--- Begin Message ---
Koizumi-san
Let me tell you the one concern about mbstring-ng.
The current mbstring supports 'ISO-2022-JP-MS', this is different from
'ISO-2022-JP'. And the current implementation of ICU can not convert
between ISO-2022-JP-MS and unicode correctly, I guess.
For example, Japanese hankaku katakana, GA, A with a sonant mark.
Although it's better way to modify ICU itself, it takes long time.
How do you think of this?
Moriyoshi Koizumi wrote:
> BTW, I created an extension that is near-compatible with mbstring and
> based on ICU that of course supports GB18030. See
> http://github.com/moriyoshi/mbstring-ng for detail.
>
Regards,
Shigeru
--- End Message ---
--- Begin Message ---
icu has at least 5 versions of iso 2022-jp.
http://demo.icu-project.org/icu-bin/convexp
If the one you refer to is not one of these send me the details and I'll log
it with the icu team.
tex
-----Original Message-----
From: KITAZAKI Shigeru [mailto:[email protected]]
Sent: Tuesday, February 02, 2010 4:43 AM
To: Moriyoshi Koizumi
Cc: [email protected]
Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
Koizumi-san
Let me tell you the one concern about mbstring-ng.
The current mbstring supports 'ISO-2022-JP-MS', this is different from
'ISO-2022-JP'. And the current implementation of ICU can not convert
between ISO-2022-JP-MS and unicode correctly, I guess.
For example, Japanese hankaku katakana, GA, A with a sonant mark.
Although it's better way to modify ICU itself, it takes long time.
How do you think of this?
Moriyoshi Koizumi wrote:
> BTW, I created an extension that is near-compatible with mbstring and
> based on ICU that of course supports GB18030. See
> http://github.com/moriyoshi/mbstring-ng for detail.
>
Regards,
Shigeru
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
While that is one of the concerns, I don't quite think having multiple
encoding conversion libraries that requires huge RSS makes much sense.
Looking for possibilities to extend ICU itself should be worth a try.
Regards,
Moriyoshi
2010/2/2 KITAZAKI Shigeru <[email protected]>:
> Koizumi-san
>
> Let me tell you the one concern about mbstring-ng.
> The current mbstring supports 'ISO-2022-JP-MS', this is different from
> 'ISO-2022-JP'. And the current implementation of ICU can not convert
> between ISO-2022-JP-MS and unicode correctly, I guess.
> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>
> Although it's better way to modify ICU itself, it takes long time.
> How do you think of this?
>
> Moriyoshi Koizumi wrote:
>> BTW, I created an extension that is near-compatible with mbstring and
>> based on ICU that of course supports GB18030. See
>> http://github.com/moriyoshi/mbstring-ng for detail.
>>
>
> Regards,
> Shigeru
>
--- End Message ---
--- Begin Message ---
None of them can handle CP50220.
Moriyoshi
2010/2/3 Tex Texin <[email protected]>:
> icu has at least 5 versions of iso 2022-jp.
>
> http://demo.icu-project.org/icu-bin/convexp
>
> If the one you refer to is not one of these send me the details and I'll log
> it with the icu team.
>
> tex
>
>
> -----Original Message-----
> From: KITAZAKI Shigeru [mailto:[email protected]]
> Sent: Tuesday, February 02, 2010 4:43 AM
> To: Moriyoshi Koizumi
> Cc: [email protected]
> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>
> Koizumi-san
>
> Let me tell you the one concern about mbstring-ng.
> The current mbstring supports 'ISO-2022-JP-MS', this is different from
> 'ISO-2022-JP'. And the current implementation of ICU can not convert
> between ISO-2022-JP-MS and unicode correctly, I guess.
> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>
> Although it's better way to modify ICU itself, it takes long time.
> How do you think of this?
>
> Moriyoshi Koizumi wrote:
>> BTW, I created an extension that is near-compatible with mbstring and
>> based on ICU that of course supports GB18030. See
>> http://github.com/moriyoshi/mbstring-ng for detail.
>>
>
> Regards,
> Shigeru
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--- End Message ---
--- Begin Message ---
It just turned out ISO_2022,locale=ja,version=3 is actually ISO-2022-JP-MS.
Moriyoshi
On Wed, Feb 3, 2010 at 10:22 AM, Moriyoshi Koizumi <[email protected]> wrote:
> None of them can handle CP50220.
>
> Moriyoshi
>
> 2010/2/3 Tex Texin <[email protected]>:
>> icu has at least 5 versions of iso 2022-jp.
>>
>> http://demo.icu-project.org/icu-bin/convexp
>>
>> If the one you refer to is not one of these send me the details and I'll log
>> it with the icu team.
>>
>> tex
>>
>>
>> -----Original Message-----
>> From: KITAZAKI Shigeru [mailto:[email protected]]
>> Sent: Tuesday, February 02, 2010 4:43 AM
>> To: Moriyoshi Koizumi
>> Cc: [email protected]
>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>
>> Koizumi-san
>>
>> Let me tell you the one concern about mbstring-ng.
>> The current mbstring supports 'ISO-2022-JP-MS', this is different from
>> 'ISO-2022-JP'. And the current implementation of ICU can not convert
>> between ISO-2022-JP-MS and unicode correctly, I guess.
>> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>>
>> Although it's better way to modify ICU itself, it takes long time.
>> How do you think of this?
>>
>> Moriyoshi Koizumi wrote:
>>> BTW, I created an extension that is near-compatible with mbstring and
>>> based on ICU that of course supports GB18030. See
>>> http://github.com/moriyoshi/mbstring-ng for detail.
>>>
>>
>> Regards,
>> Shigeru
>>
>> --
>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>
--- End Message ---
--- Begin Message ---
Yes- 50220 is just normal ISO-2022-JP:
http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
-----Original Message-----
From: Moriyoshi Koizumi [mailto:[email protected]]
Sent: Tuesday, February 02, 2010 10:54 PM
To: KITAZAKI Shigeru
Cc: [email protected]
Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
It just turned out ISO_2022,locale=ja,version=3 is actually ISO-2022-JP-MS.
Moriyoshi
On Wed, Feb 3, 2010 at 10:22 AM, Moriyoshi Koizumi <[email protected]> wrote:
> None of them can handle CP50220.
>
> Moriyoshi
>
> 2010/2/3 Tex Texin <[email protected]>:
>> icu has at least 5 versions of iso 2022-jp.
>>
>> http://demo.icu-project.org/icu-bin/convexp
>>
>> If the one you refer to is not one of these send me the details and I'll
log
>> it with the icu team.
>>
>> tex
>>
>>
>> -----Original Message-----
>> From: KITAZAKI Shigeru [mailto:[email protected]]
>> Sent: Tuesday, February 02, 2010 4:43 AM
>> To: Moriyoshi Koizumi
>> Cc: [email protected]
>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>
>> Koizumi-san
>>
>> Let me tell you the one concern about mbstring-ng.
>> The current mbstring supports 'ISO-2022-JP-MS', this is different from
>> 'ISO-2022-JP'. And the current implementation of ICU can not convert
>> between ISO-2022-JP-MS and unicode correctly, I guess.
>> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>>
>> Although it's better way to modify ICU itself, it takes long time.
>> How do you think of this?
>>
>> Moriyoshi Koizumi wrote:
>>> BTW, I created an extension that is near-compatible with mbstring and
>>> based on ICU that of course supports GB18030. See
>>> http://github.com/moriyoshi/mbstring-ng for detail.
>>>
>>
>> Regards,
>> Shigeru
>>
>> --
>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
That is not correct. .NET Names here are also used intenally in MS
products as well as codepages, and doesn't necessarily reflect the
actual codeset defined in the IANA charset if the names look the same.
Look at "additional information" for the differences.
Moriyoshi
On Wed, Feb 3, 2010 at 4:16 PM, Tex Texin <[email protected]> wrote:
> Yes- 50220 is just normal ISO-2022-JP:
> http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
>
>
> -----Original Message-----
> From: Moriyoshi Koizumi [mailto:[email protected]]
> Sent: Tuesday, February 02, 2010 10:54 PM
> To: KITAZAKI Shigeru
> Cc: [email protected]
> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>
> It just turned out ISO_2022,locale=ja,version=3 is actually ISO-2022-JP-MS.
>
> Moriyoshi
>
> On Wed, Feb 3, 2010 at 10:22 AM, Moriyoshi Koizumi <[email protected]> wrote:
>> None of them can handle CP50220.
>>
>> Moriyoshi
>>
>> 2010/2/3 Tex Texin <[email protected]>:
>>> icu has at least 5 versions of iso 2022-jp.
>>>
>>> http://demo.icu-project.org/icu-bin/convexp
>>>
>>> If the one you refer to is not one of these send me the details and I'll
> log
>>> it with the icu team.
>>>
>>> tex
>>>
>>>
>>> -----Original Message-----
>>> From: KITAZAKI Shigeru [mailto:[email protected]]
>>> Sent: Tuesday, February 02, 2010 4:43 AM
>>> To: Moriyoshi Koizumi
>>> Cc: [email protected]
>>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>>
>>> Koizumi-san
>>>
>>> Let me tell you the one concern about mbstring-ng.
>>> The current mbstring supports 'ISO-2022-JP-MS', this is different from
>>> 'ISO-2022-JP'. And the current implementation of ICU can not convert
>>> between ISO-2022-JP-MS and unicode correctly, I guess.
>>> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>>>
>>> Although it's better way to modify ICU itself, it takes long time.
>>> How do you think of this?
>>>
>>> Moriyoshi Koizumi wrote:
>>>> BTW, I created an extension that is near-compatible with mbstring and
>>>> based on ICU that of course supports GB18030. See
>>>> http://github.com/moriyoshi/mbstring-ng for detail.
>>>>
>>>
>>> Regards,
>>> Shigeru
>>>
>>> --
>>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>>>
>>
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--- End Message ---
--- Begin Message ---
hi,
On Mon, Feb 1, 2010 at 8:59 PM, Tex Texin <[email protected]> wrote:
> Since ICU supports many conversions, including gb18030, and is regularly
> updated and is already a part of php, it makes no sense to include
> individually written conversions. ICU also gets considerable testing and
> review.
>
> Conversions should all be driven through icu.
In an ideal world yes. But in our not so ideal world mbstring is still
here, is still used in many places inside PHP and I think it makes
sense too add more encoding if there is a need for them.
However it is obvious that the mid/long term goal should be to replace
it completely with ICU.
Cheers,
--
Pierre
@pierrejoye | http://blog.thepimp.net | http://www.libgd.org
--- End Message ---
--- Begin Message ---
Hi!
In an ideal world yes. But in our not so ideal world mbstring is still
here, is still used in many places inside PHP and I think it makes
sense too add more encoding if there is a need for them.
Can't we make mbstring to use ICU data so that if somebody uses the API
he gets the API, but encodings list is all of those supported by ICU?
--
Stanislav Malyshev, Zend Software Architect
[email protected] http://www.zend.com/
(408)253-8829 MSN: [email protected]
--- End Message ---
--- Begin Message ---
mbstring can call out to icu to do the work.
-----Original Message-----
From: Pierre Joye [mailto:[email protected]]
Sent: Monday, February 01, 2010 12:11 PM
To: Tex Texin
Cc: Moriyoshi Koizumi; KITAZAKI Shigeru; [email protected];
[email protected]
Subject: Re: [PHP-DEV] RE: [PHP-I18N] adding GB18030 support for mbstring
hi,
On Mon, Feb 1, 2010 at 8:59 PM, Tex Texin <[email protected]> wrote:
> Since ICU supports many conversions, including gb18030, and is regularly
> updated and is already a part of php, it makes no sense to include
> individually written conversions. ICU also gets considerable testing and
> review.
>
> Conversions should all be driven through icu.
In an ideal world yes. But in our not so ideal world mbstring is still
here, is still used in many places inside PHP and I think it makes
sense too add more encoding if there is a need for them.
However it is obvious that the mid/long term goal should be to replace
it completely with ICU.
Cheers,
--
Pierre
@pierrejoye | http://blog.thepimp.net | http://www.libgd.org
--- End Message ---
--- Begin Message ---
hi,
On Mon, Feb 1, 2010 at 9:25 PM, Tex Texin <[email protected]> wrote:
> mbstring can call out to icu to do the work.
Right, pls read the thread, that's Moriyoshi is working on. But can we
do it in a minor release? I don't think so.
Cheers,
> -----Original Message-----
> From: Pierre Joye [mailto:[email protected]]
> Sent: Monday, February 01, 2010 12:11 PM
> To: Tex Texin
> Cc: Moriyoshi Koizumi; KITAZAKI Shigeru; [email protected];
> [email protected]
> Subject: Re: [PHP-DEV] RE: [PHP-I18N] adding GB18030 support for mbstring
>
> hi,
>
> On Mon, Feb 1, 2010 at 8:59 PM, Tex Texin <[email protected]> wrote:
>> Since ICU supports many conversions, including gb18030, and is regularly
>> updated and is already a part of php, it makes no sense to include
>> individually written conversions. ICU also gets considerable testing and
>> review.
>>
>> Conversions should all be driven through icu.
>
> In an ideal world yes. But in our not so ideal world mbstring is still
> here, is still used in many places inside PHP and I think it makes
> sense too add more encoding if there is a need for them.
>
> However it is obvious that the mid/long term goal should be to replace
> it completely with ICU.
>
> Cheers,
> --
> Pierre
>
> @pierrejoye | http://blog.thepimp.net | http://www.libgd.org
>
>
--
Pierre
@pierrejoye | http://blog.thepimp.net | http://www.libgd.org
--- End Message ---
--- Begin Message ---
Hi!
Right, pls read the thread, that's Moriyoshi is working on. But can we
do it in a minor release? I don't think so.
If it returns same results for existing encodings (which should be the
case, ideally, since the encodings are defined, but you know...) then
why not? The external API would stay the same, not?
Though I don't know in detail which work is being done, so I'm speaking
theoretically.
--
Stanislav Malyshev, Zend Software Architect
[email protected] http://www.zend.com/
(408)253-8829 MSN: [email protected]
--- End Message ---
--- Begin Message ---
It can also be that you keep the existing conversions and call out to icu
for others...
-----Original Message-----
From: Stanislav Malyshev [mailto:[email protected]]
Sent: Monday, February 01, 2010 1:55 PM
To: [email protected]
Subject: Re: [PHP-I18N] Re: [PHP-DEV] RE: [PHP-I18N] adding GB18030 support
for mbstring
Hi!
> Right, pls read the thread, that's Moriyoshi is working on. But can we
> do it in a minor release? I don't think so.
If it returns same results for existing encodings (which should be the
case, ideally, since the encodings are defined, but you know...) then
why not? The external API would stay the same, not?
Though I don't know in detail which work is being done, so I'm speaking
theoretically.
--
Stanislav Malyshev, Zend Software Architect
[email protected] http://www.zend.com/
(408)253-8829 MSN: [email protected]
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
It looks like no one knows that mbstring utilizes a conversion library
called libmbfl, which is distributed as a separate package. It's
possible that we unbundle it and let users install it.
ICU-backed mbstring covers most of the functions except for MIME-stuff
including mb_send_mail(). The exclusion is rather intentional as it
has major problems in its design.
Moriyoshi
On Tue, Feb 2, 2010 at 9:00 AM, Tex Texin <[email protected]> wrote:
> It can also be that you keep the existing conversions and call out to icu
> for others...
>
>
>
> -----Original Message-----
> From: Stanislav Malyshev [mailto:[email protected]]
> Sent: Monday, February 01, 2010 1:55 PM
> To: [email protected]
> Subject: Re: [PHP-I18N] Re: [PHP-DEV] RE: [PHP-I18N] adding GB18030 support
> for mbstring
>
> Hi!
>
>> Right, pls read the thread, that's Moriyoshi is working on. But can we
>> do it in a minor release? I don't think so.
>
> If it returns same results for existing encodings (which should be the
> case, ideally, since the encodings are defined, but you know...) then
> why not? The external API would stay the same, not?
> Though I don't know in detail which work is being done, so I'm speaking
> theoretically.
> --
> Stanislav Malyshev, Zend Software Architect
> [email protected] http://www.zend.com/
> (408)253-8829 MSN: [email protected]
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--- End Message ---