php-i18n Digest 26 Feb 2010 16:21:28 -0000 Issue 438
Topics (messages 1370 through 1375):
Re: adding GB18030 support for mbstring
1370 by: KITAZAKI Shigeru
1371 by: Tex Texin
1372 by: Tex Texin
1373 by: Moriyoshi Koizumi
1374 by: Tex Texin
ctype_print returns false for British Pound symbol (and non-ASCII symbols)
1375 by: Bob
Administrivia:
To subscribe to the digest, e-mail:
[email protected]
To unsubscribe from the digest, e-mail:
[email protected]
To post to the list, e-mail:
[email protected]
----------------------------------------------------------------------
--- Begin Message ---
Moriyoshi Koizumi wrote:
> It just turned out ISO_2022,locale=ja,version=3 is actually ISO-2022-JP-MS.
Thank you for your valuable information.
I'll try ISO_2022,locale=ja,version=3 with ISO-2022-JP-MS.
Tex, thank you for your cooperation.
'halfwidth', called 'hankaku' in Japanese, is sometimes special case to
convert encodings. If you are interested in Japanese too, this article is
useful.
http://en.wikipedia.org/wiki/Katakana
Shigeru
--- End Message ---
--- Begin Message ---
thanks, I understand Japanese scripts and encodings.
-----Original Message-----
From: KITAZAKI Shigeru [mailto:[email protected]]
Sent: Wednesday, February 03, 2010 12:08 AM
To: Moriyoshi Koizumi; Tex Texin
Cc: [email protected]
Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
Moriyoshi Koizumi wrote:
> It just turned out ISO_2022,locale=ja,version=3 is actually
ISO-2022-JP-MS.
Thank you for your valuable information.
I'll try ISO_2022,locale=ja,version=3 with ISO-2022-JP-MS.
Tex, thank you for your cooperation.
'halfwidth', called 'hankaku' in Japanese, is sometimes special case to
convert encodings. If you are interested in Japanese too, this article is
useful.
http://en.wikipedia.org/wiki/Katakana
Shigeru
--- End Message ---
--- Begin Message ---
Yes, microsoft doc is often loose with respect to encodings.
E.g. they claim 932 and shift-jis are the same when they aren't, etc.
I'll look for confirmation from Kitazaki-san that
ISO_2022,locale=ja,version=3 works.
tex
-----Original Message-----
From: Moriyoshi Koizumi [mailto:[email protected]]
Sent: Tuesday, February 02, 2010 11:32 PM
To: Tex Texin
Cc: KITAZAKI Shigeru; [email protected]
Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
That is not correct. .NET Names here are also used intenally in MS
products as well as codepages, and doesn't necessarily reflect the
actual codeset defined in the IANA charset if the names look the same.
Look at "additional information" for the differences.
Moriyoshi
On Wed, Feb 3, 2010 at 4:16 PM, Tex Texin <[email protected]> wrote:
> Yes- 50220 is just normal ISO-2022-JP:
> http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
>
>
> -----Original Message-----
> From: Moriyoshi Koizumi [mailto:[email protected]]
> Sent: Tuesday, February 02, 2010 10:54 PM
> To: KITAZAKI Shigeru
> Cc: [email protected]
> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>
> It just turned out ISO_2022,locale=ja,version=3 is actually
ISO-2022-JP-MS.
>
> Moriyoshi
>
> On Wed, Feb 3, 2010 at 10:22 AM, Moriyoshi Koizumi <[email protected]> wrote:
>> None of them can handle CP50220.
>>
>> Moriyoshi
>>
>> 2010/2/3 Tex Texin <[email protected]>:
>>> icu has at least 5 versions of iso 2022-jp.
>>>
>>> http://demo.icu-project.org/icu-bin/convexp
>>>
>>> If the one you refer to is not one of these send me the details and I'll
> log
>>> it with the icu team.
>>>
>>> tex
>>>
>>>
>>> -----Original Message-----
>>> From: KITAZAKI Shigeru [mailto:[email protected]]
>>> Sent: Tuesday, February 02, 2010 4:43 AM
>>> To: Moriyoshi Koizumi
>>> Cc: [email protected]
>>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>>
>>> Koizumi-san
>>>
>>> Let me tell you the one concern about mbstring-ng.
>>> The current mbstring supports 'ISO-2022-JP-MS', this is different from
>>> 'ISO-2022-JP'. And the current implementation of ICU can not convert
>>> between ISO-2022-JP-MS and unicode correctly, I guess.
>>> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>>>
>>> Although it's better way to modify ICU itself, it takes long time.
>>> How do you think of this?
>>>
>>> Moriyoshi Koizumi wrote:
>>>> BTW, I created an extension that is near-compatible with mbstring and
>>>> based on ICU that of course supports GB18030. See
>>>> http://github.com/moriyoshi/mbstring-ng for detail.
>>>>
>>>
>>> Regards,
>>> Shigeru
>>>
>>> --
>>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>>>
>>
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>
--- End Message ---
--- Begin Message ---
Note that CP50220 isn't identical to ISO-2022-JP-MS despite the naming.
JISX0201 katakanas are not allowed by CP50220
Moriyoshi
On 2/3/10, Tex Texin <[email protected]> wrote:
> Yes, microsoft doc is often loose with respect to encodings.
> E.g. they claim 932 and shift-jis are the same when they aren't, etc.
>
> I'll look for confirmation from Kitazaki-san that
> ISO_2022,locale=ja,version=3 works.
>
> tex
> -----Original Message-----
> From: Moriyoshi Koizumi [mailto:[email protected]]
> Sent: Tuesday, February 02, 2010 11:32 PM
> To: Tex Texin
> Cc: KITAZAKI Shigeru; [email protected]
> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>
> That is not correct. .NET Names here are also used intenally in MS
> products as well as codepages, and doesn't necessarily reflect the
> actual codeset defined in the IANA charset if the names look the same.
> Look at "additional information" for the differences.
>
> Moriyoshi
>
> On Wed, Feb 3, 2010 at 4:16 PM, Tex Texin <[email protected]> wrote:
>> Yes- 50220 is just normal ISO-2022-JP:
>> http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
>>
>>
>> -----Original Message-----
>> From: Moriyoshi Koizumi [mailto:[email protected]]
>> Sent: Tuesday, February 02, 2010 10:54 PM
>> To: KITAZAKI Shigeru
>> Cc: [email protected]
>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>
>> It just turned out ISO_2022,locale=ja,version=3 is actually
> ISO-2022-JP-MS.
>>
>> Moriyoshi
>>
>> On Wed, Feb 3, 2010 at 10:22 AM, Moriyoshi Koizumi <[email protected]> wrote:
>>> None of them can handle CP50220.
>>>
>>> Moriyoshi
>>>
>>> 2010/2/3 Tex Texin <[email protected]>:
>>>> icu has at least 5 versions of iso 2022-jp.
>>>>
>>>> http://demo.icu-project.org/icu-bin/convexp
>>>>
>>>> If the one you refer to is not one of these send me the details and I'll
>> log
>>>> it with the icu team.
>>>>
>>>> tex
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: KITAZAKI Shigeru [mailto:[email protected]]
>>>> Sent: Tuesday, February 02, 2010 4:43 AM
>>>> To: Moriyoshi Koizumi
>>>> Cc: [email protected]
>>>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>>>
>>>> Koizumi-san
>>>>
>>>> Let me tell you the one concern about mbstring-ng.
>>>> The current mbstring supports 'ISO-2022-JP-MS', this is different from
>>>> 'ISO-2022-JP'. And the current implementation of ICU can not convert
>>>> between ISO-2022-JP-MS and unicode correctly, I guess.
>>>> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>>>>
>>>> Although it's better way to modify ICU itself, it takes long time.
>>>> How do you think of this?
>>>>
>>>> Moriyoshi Koizumi wrote:
>>>>> BTW, I created an extension that is near-compatible with mbstring and
>>>>> based on ICU that of course supports GB18030. See
>>>>> http://github.com/moriyoshi/mbstring-ng for detail.
>>>>>
>>>>
>>>> Regards,
>>>> Shigeru
>>>>
>>>> --
>>>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>
>>>>
>>>
>>
>> --
>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>
>
--- End Message ---
--- Begin Message ---
Yes the doc says that. See cp50221 which adds the kana and cp 50222
-----Original Message-----
From: Moriyoshi Koizumi [mailto:[email protected]]
Sent: Wednesday, February 03, 2010 1:47 AM
To: Tex Texin
Cc: KITAZAKI Shigeru; [email protected]
Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
Note that CP50220 isn't identical to ISO-2022-JP-MS despite the naming.
JISX0201 katakanas are not allowed by CP50220
Moriyoshi
On 2/3/10, Tex Texin <[email protected]> wrote:
> Yes, microsoft doc is often loose with respect to encodings.
> E.g. they claim 932 and shift-jis are the same when they aren't, etc.
>
> I'll look for confirmation from Kitazaki-san that
> ISO_2022,locale=ja,version=3 works.
>
> tex
> -----Original Message-----
> From: Moriyoshi Koizumi [mailto:[email protected]]
> Sent: Tuesday, February 02, 2010 11:32 PM
> To: Tex Texin
> Cc: KITAZAKI Shigeru; [email protected]
> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>
> That is not correct. .NET Names here are also used intenally in MS
> products as well as codepages, and doesn't necessarily reflect the
> actual codeset defined in the IANA charset if the names look the same.
> Look at "additional information" for the differences.
>
> Moriyoshi
>
> On Wed, Feb 3, 2010 at 4:16 PM, Tex Texin <[email protected]> wrote:
>> Yes- 50220 is just normal ISO-2022-JP:
>> http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
>>
>>
>> -----Original Message-----
>> From: Moriyoshi Koizumi [mailto:[email protected]]
>> Sent: Tuesday, February 02, 2010 10:54 PM
>> To: KITAZAKI Shigeru
>> Cc: [email protected]
>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>
>> It just turned out ISO_2022,locale=ja,version=3 is actually
> ISO-2022-JP-MS.
>>
>> Moriyoshi
>>
>> On Wed, Feb 3, 2010 at 10:22 AM, Moriyoshi Koizumi <[email protected]> wrote:
>>> None of them can handle CP50220.
>>>
>>> Moriyoshi
>>>
>>> 2010/2/3 Tex Texin <[email protected]>:
>>>> icu has at least 5 versions of iso 2022-jp.
>>>>
>>>> http://demo.icu-project.org/icu-bin/convexp
>>>>
>>>> If the one you refer to is not one of these send me the details and
I'll
>> log
>>>> it with the icu team.
>>>>
>>>> tex
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: KITAZAKI Shigeru [mailto:[email protected]]
>>>> Sent: Tuesday, February 02, 2010 4:43 AM
>>>> To: Moriyoshi Koizumi
>>>> Cc: [email protected]
>>>> Subject: Re: [PHP-I18N] adding GB18030 support for mbstring
>>>>
>>>> Koizumi-san
>>>>
>>>> Let me tell you the one concern about mbstring-ng.
>>>> The current mbstring supports 'ISO-2022-JP-MS', this is different from
>>>> 'ISO-2022-JP'. And the current implementation of ICU can not convert
>>>> between ISO-2022-JP-MS and unicode correctly, I guess.
>>>> For example, Japanese hankaku katakana, GA, A with a sonant mark.
>>>>
>>>> Although it's better way to modify ICU itself, it takes long time.
>>>> How do you think of this?
>>>>
>>>> Moriyoshi Koizumi wrote:
>>>>> BTW, I created an extension that is near-compatible with mbstring and
>>>>> based on ICU that of course supports GB18030. See
>>>>> http://github.com/moriyoshi/mbstring-ng for detail.
>>>>>
>>>>
>>>> Regards,
>>>> Shigeru
>>>>
>>>> --
>>>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>
>>>>
>>>
>>
>> --
>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>
>
--- End Message ---
--- Begin Message ---
[I did post this to php.general, but I think php.i18n may be more
suitable.]
In summary: ctype_print returns false for a string containing the British
Pound symbol, and I'm sure that's not how it should behave.
So far as I can tell, the British Pound symbol, '£' is considered a
printable character according to the locale I use on my Ubuntu box. But
even across two years, two boxes, several versions of Ubuntu (from 7.04
to 9.10, one x86, one AMD64), and two major versions of PHP (PHP 4 and
now PHP 5.2.11), I cannot get ctype_print to return true when a string
given to it contains the British Pound symbol. (Or other non-ASCII
characters such as ø or ß.)
The locale I'm using is en_GB.UTF-8 and when I call setlocale(LC_ALL,
'en_GB.UTF-8') in PHP, it returns the name of this locale rather than
FALSE, so that seems to be in order. (However, to be sure I have
installed and reinstalled the language pack in Ubuntu as suggested by
others.)
I've even read through the en_GB and i18n locale definition files to
confirm that <U00A3> (for the British Pound symbol) does appear within
the print and graph sections, so both ctype_print and ctype_graph should
consider it acceptable.
What's most maddening is that ctype_print does return true on my shared
hosting server, so I know that it can be achieved. I'm just hoping that
someone here can tell me what I'm doing wrong, or what my operating
system is doing wrong.
For your information, I'm currently running the following:
Ubuntu 9.10 (AMD64)
Apache 2.2.14
PHP 5.2.11 running as a CGI (to mirror the config of my shared host)
Locale in use: en_GB.UTF-8
LANG=en_GB.UTF-8
Can anyone tell me how to get ctype_print to behave?
--- End Message ---