php-i18n Digest 2 Apr 2010 18:45:29 -0000 Issue 439
Topics (messages 1376 through 1392):
Re: ctype_print returns false for British Pound symbol (and non-ASCII symbols)
1376 by: Rasmus Lerdorf
1377 by: Norbert Lindenberg â»
1378 by: Bob
1379 by: Rasmus Lerdorf
1380 by: Bob
1381 by: Bob
1382 by: Rasmus Lerdorf
1383 by: Bob
1384 by: Rasmus Lerdorf
1385 by: Norbert Lindenberg â»
1386 by: Jerry Schwartz
1387 by: Bob
1388 by: Bob
1389 by: Stanislav Malyshev
1390 by: Rasmus Lerdorf
1391 by: Bob
SimpleXMLElement occasionally fails to parse gb2312 or big5 RSS feeds
1392 by: Peter Pei
Administrivia:
To subscribe to the digest, e-mail:
[email protected]
To unsubscribe from the digest, e-mail:
[email protected]
To post to the list, e-mail:
[email protected]
----------------------------------------------------------------------
--- Begin Message ---
I doubt this has anything to do with PHP. The ctype functions are just
direct wrappers for your native ctype calls. Try this:
create a file called a.c:
#include <stdio.h>
#include <ctype.h>
void main(int argc, char **argv) {
printf("%d\n",isprint(*argv[1]));
}
Compile it with: make a
Then try:
10:21am new:~> ./a £
0
10:21am new:~> ./a $
16384
Same result. My LOCALE doesn't think £ is printable, but $ is. Switch
to dollars or fix your LOCALE.
-Rasmus
Bob wrote:
> [I did post this to php.general, but I think php.i18n may be more
> suitable.]
>
> In summary: ctype_print returns false for a string containing the British
> Pound symbol, and I'm sure that's not how it should behave.
>
> So far as I can tell, the British Pound symbol, '£' is considered a
> printable character according to the locale I use on my Ubuntu box. But
> even across two years, two boxes, several versions of Ubuntu (from 7.04
> to 9.10, one x86, one AMD64), and two major versions of PHP (PHP 4 and
> now PHP 5.2.11), I cannot get ctype_print to return true when a string
> given to it contains the British Pound symbol. (Or other non-ASCII
> characters such as ø or ß.)
>
> The locale I'm using is en_GB.UTF-8 and when I call setlocale(LC_ALL,
> 'en_GB.UTF-8') in PHP, it returns the name of this locale rather than
> FALSE, so that seems to be in order. (However, to be sure I have
> installed and reinstalled the language pack in Ubuntu as suggested by
> others.)
>
> I've even read through the en_GB and i18n locale definition files to
> confirm that <U00A3> (for the British Pound symbol) does appear within
> the print and graph sections, so both ctype_print and ctype_graph should
> consider it acceptable.
>
> What's most maddening is that ctype_print does return true on my shared
> hosting server, so I know that it can be achieved. I'm just hoping that
> someone here can tell me what I'm doing wrong, or what my operating
> system is doing wrong.
>
> For your information, I'm currently running the following:
>
> Ubuntu 9.10 (AMD64)
> Apache 2.2.14
> PHP 5.2.11 running as a CGI (to mirror the config of my shared host)
> Locale in use: en_GB.UTF-8
> LANG=en_GB.UTF-8
>
> Can anyone tell me how to get ctype_print to behave?
>
--- End Message ---
--- Begin Message ---
In which character encoding is your '£' represented? Remember that PHP
is ignorant about character encodings, a string is just a sequence of
bytes, and it's up to the application developer to make all components
agree on the character encoding used. If your '£' happens to be
encoded in ISO 8859-1, then its byte representation is the same as
"\xA3", which is not a valid UTF-8 string.
Norbert
On Feb 26, 2010, at 08:21 , Bob wrote:
[I did post this to php.general, but I think php.i18n may be more
suitable.]
In summary: ctype_print returns false for a string containing the
British
Pound symbol, and I'm sure that's not how it should behave.
So far as I can tell, the British Pound symbol, '£' is considered a
printable character according to the locale I use on my Ubuntu box.
But
even across two years, two boxes, several versions of Ubuntu (from
7.04
to 9.10, one x86, one AMD64), and two major versions of PHP (PHP 4 and
now PHP 5.2.11), I cannot get ctype_print to return true when a string
given to it contains the British Pound symbol. (Or other non-ASCII
characters such as ø or ß.)
The locale I'm using is en_GB.UTF-8 and when I call setlocale(LC_ALL,
'en_GB.UTF-8') in PHP, it returns the name of this locale rather than
FALSE, so that seems to be in order. (However, to be sure I have
installed and reinstalled the language pack in Ubuntu as suggested by
others.)
I've even read through the en_GB and i18n locale definition files to
confirm that <U00A3> (for the British Pound symbol) does appear within
the print and graph sections, so both ctype_print and ctype_graph
should
consider it acceptable.
What's most maddening is that ctype_print does return true on my
shared
hosting server, so I know that it can be achieved. I'm just hoping
that
someone here can tell me what I'm doing wrong, or what my operating
system is doing wrong.
For your information, I'm currently running the following:
Ubuntu 9.10 (AMD64)
Apache 2.2.14
PHP 5.2.11 running as a CGI (to mirror the config of my shared host)
Locale in use: en_GB.UTF-8
LANG=en_GB.UTF-8
Can anyone tell me how to get ctype_print to behave?
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
Hello, Rasmus.
Thank you for the excellent advice. I was trying to work out how to call
the C-native version of ctype_print, and you managed to explain how to do
so in very few bytes of text. (And led me to compile my first C program
in Linux. Haven't used C for about fifteen years.)
I get the exact same output as yourself:
Ubuntu:~/ctype_print$ ./a £
0
Ubuntu:~/ctype_print$ ./a $
16384
So you're right. It's nothing to do with PHP.
Which means that my question is now: how do I fix my locale? The £ is
definitely in the locale definition file (under "print") for i18n, which
is copied into the LC_CTYPE section by en_GB. So am I right in thinking
that it should be a valid printable character when using that locale?
--- End Message ---
--- Begin Message ---
Bob wrote:
> Hello, Rasmus.
>
> Thank you for the excellent advice. I was trying to work out how to call
> the C-native version of ctype_print, and you managed to explain how to do
> so in very few bytes of text. (And led me to compile my first C program
> in Linux. Haven't used C for about fifteen years.)
>
> I get the exact same output as yourself:
>
> Ubuntu:~/ctype_print$ ./a £
> 0
> Ubuntu:~/ctype_print$ ./a $
> 16384
>
> So you're right. It's nothing to do with PHP.
>
> Which means that my question is now: how do I fix my locale? The £ is
> definitely in the locale definition file (under "print") for i18n, which
> is copied into the LC_CTYPE section by en_GB. So am I right in thinking
> that it should be a valid printable character when using that locale?
Like Norbert asked, which charset are you working in and does it match
your LOCALE?
-Rasmus
--- End Message ---
--- Begin Message ---
Hello, Norbert.
I'm using Netbeans IDE and it's in UTF-8 mode, as is my en_GB.UTF-8
locale.
Just to be sure, though, I also tried this:
$string = "\xc2\xa3"; //UTF byte encoding for the British Pound sign
$this->assertTrue(ctype_print($string));
I believe that \xc2\xa3 is the UTF-8 byte encoding for the £ symbol, but
correct me if I'm wrong.
--- End Message ---
--- Begin Message ---
In short, yes I believe everything I'm using is set to be using UTF-8 to
match my locale, but see my response to Norbert for the longer answer.
--- End Message ---
--- Begin Message ---
Bob wrote:
> Hello, Norbert.
>
> I'm using Netbeans IDE and it's in UTF-8 mode, as is my en_GB.UTF-8
> locale.
>
> Just to be sure, though, I also tried this:
>
> $string = "\xc2\xa3"; //UTF byte encoding for the British Pound sign
> $this->assertTrue(ctype_print($string));
>
> I believe that \xc2\xa3 is the UTF-8 byte encoding for the £ symbol, but
> correct me if I'm wrong.
ctype functions do not support multibyte encodings.
-Rasmus
--- End Message ---
--- Begin Message ---
So they don't work with UTF-8?
--- End Message ---
--- Begin Message ---
Bob wrote:
> So they don't work with UTF-8?
They'll work with the single-byte UTF-8 chars, but not the multi-byte ones.
-Rasmus
--- End Message ---
--- Begin Message ---
An alternative to ctype might be the PCRE extension. It can be set to
UTF-8 mode by using the /u pattern modifier, and knows about the
Unicode character classes. See
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.php.net/manual/en/regexp.reference.unicode.php
Norbert
On Feb 26, 2010, at 14:03 , Rasmus Lerdorf wrote:
Bob wrote:
So they don't work with UTF-8?
They'll work with the single-byte UTF-8 chars, but not the multi-
byte ones.
-Rasmus
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
Also, for what it's worth, Microsoft uses a slightly different encoding in
CP-1252. I run into this all the time when people copy/paste from Word to ...
Regards,
Jerry Schwartz
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032
860.674.8796 / FAX: 860.674.8341
www.the-infoshop.com
>-----Original Message-----
>From: Norbert Lindenberg ? [mailto:[email protected]]
>Sent: Friday, February 26, 2010 1:54 PM
>To: [email protected]
>Cc: Norbert Lindenberg ?
>Subject: Re: [PHP-I18N] ctype_print returns false for British Pound symbol
>(and
>non-ASCII symbols)
>
>In which character encoding is your '£' represented? Remember that PHP
>is ignorant about character encodings, a string is just a sequence of
>bytes, and it's up to the application developer to make all components
>agree on the character encoding used. If your '£' happens to be
>encoded in ISO 8859-1, then its byte representation is the same as
>"\xA3", which is not a valid UTF-8 string.
>
>Norbert
>
>
>On Feb 26, 2010, at 08:21 , Bob wrote:
>
>> [I did post this to php.general, but I think php.i18n may be more
>> suitable.]
>>
>> In summary: ctype_print returns false for a string containing the
>> British
>> Pound symbol, and I'm sure that's not how it should behave.
>>
>> So far as I can tell, the British Pound symbol, '£' is considered a
>> printable character according to the locale I use on my Ubuntu box.
>> But
>> even across two years, two boxes, several versions of Ubuntu (from
>> 7.04
>> to 9.10, one x86, one AMD64), and two major versions of PHP (PHP 4 and
>> now PHP 5.2.11), I cannot get ctype_print to return true when a string
>> given to it contains the British Pound symbol. (Or other non-ASCII
>> characters such as ø or ß.)
>>
>> The locale I'm using is en_GB.UTF-8 and when I call setlocale(LC_ALL,
>> 'en_GB.UTF-8') in PHP, it returns the name of this locale rather than
>> FALSE, so that seems to be in order. (However, to be sure I have
>> installed and reinstalled the language pack in Ubuntu as suggested by
>> others.)
>>
>> I've even read through the en_GB and i18n locale definition files to
>> confirm that <U00A3> (for the British Pound symbol) does appear within
>> the print and graph sections, so both ctype_print and ctype_graph
>> should
>> consider it acceptable.
>>
>> What's most maddening is that ctype_print does return true on my
>> shared
>> hosting server, so I know that it can be achieved. I'm just hoping
>> that
>> someone here can tell me what I'm doing wrong, or what my operating
>> system is doing wrong.
>>
>> For your information, I'm currently running the following:
>>
>> Ubuntu 9.10 (AMD64)
>> Apache 2.2.14
>> PHP 5.2.11 running as a CGI (to mirror the config of my shared host)
>> Locale in use: en_GB.UTF-8
>> LANG=en_GB.UTF-8
>>
>> Can anyone tell me how to get ctype_print to behave?
>>
>> --
>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>
>
>--
>PHP Unicode & I18N Mailing List (http://www.php.net/)
>To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
You're joking?
So the ctype functions are barely of of any use for characters beyond the
ASCII range?
Is that by design, or due to technical limitations? Either way, it should
be clearly stated in the PHP documentation.
And why does ctype_print return true for the British Pound symbol for
some people (including my hosting company's server) but false for others?
(Someone on php.general confirmed this strange disparity.)
--- End Message ---
--- Begin Message ---
Well, in the least case Ubuntu has a locale lookup problem, and in the
worst case the ctype functions are all but useless for real world usage.
(Unless text that contains British currency is considered highly exotic.)
I'm very familiar with PCRE, so I guess I'm going to have to finally give
up on ctype and put together an analogue using preg_match. Just a pity
that the built-in ctype family seem so problematic.
--- End Message ---
--- Begin Message ---
Hi!
So the ctype functions are barely of of any use for characters beyond the
ASCII range?
Is that by design, or due to technical limitations? Either way, it should
be clearly stated in the PHP documentation.
PHP is not a Unicode language yet. If you think it's a problem you're
welcome to port ext/unicode stuff from PHP 6 branch. ext/intl does a lot
of string stuff (collations, etc.) but not character stuff.
And why does ctype_print return true for the British Pound symbol for
some people (including my hosting company's server) but false for others?
(Someone on php.general confirmed this strange disparity.)
Probably different encodings or different locale databases.
--
Stanislav Malyshev, Zend Software Architect
[email protected] http://www.zend.com/
(408)253-8829 MSN: [email protected]
--- End Message ---
--- Begin Message ---
Bob wrote:
> You're joking?
>
> So the ctype functions are barely of of any use for characters beyond the
> ASCII range?
>
> Is that by design, or due to technical limitations? Either way, it should
> be clearly stated in the PHP documentation.
>
> And why does ctype_print return true for the British Pound symbol for
> some people (including my hosting company's server) but false for others?
> (Someone on php.general confirmed this strange disparity.)
Like I said, the PHP ctype functions are just thin wrappers over the
underlying system's ctype functions. Like many other things in PHP, we
are just a thin shell on top of basic system capabilities. Whatever
restrictions apply to the underlying system will apply to the PHP functions.
And, it works for some people because those people passed in the
single-byte ISO-8859 pound character whereas for the non-working version
you are passing in the 2-byte UTF-8 character.
-Rasmus
--- End Message ---
--- Begin Message ---
On Fri, 26 Feb 2010 16:18:40 -0800, Rasmus Lerdorf wrote:
> Like I said, the PHP ctype functions are just thin wrappers over the
> underlying system's ctype functions. Like many other things in PHP, we
> are just a thin shell on top of basic system capabilities. Whatever
> restrictions apply to the underlying system will apply to the PHP
> functions.
>
> And, it works for some people because those people passed in the
> single-byte ISO-8859 pound character whereas for the non-working version
> you are passing in the 2-byte UTF-8 character.
Very disappointing. But thank you all for helping to clear this up.
I'll knock together a compromise regex that makes sure no control
characters are present, and then do a project-wide find and replace.
--- End Message ---
--- Begin Message ---
I use the following code to get rss and parse it, but the code
occasionally have issues with gb2312 or big-5 encoded feeds, and fails to
parse them. However other times may appear just okay. Any thoughts? Maybe
SimpleXMLElement is simply not meant for other language encodings...
$page = file_get_contents($rss);
try {
$feed = new SimpleXMLElement($page);
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
--- End Message ---