php-i18n Digest 19 Nov 2004 11:32:50 -0000 Issue 261

Topics (messages 792 through 800):

Re: Accented characters
        792 by: Christophe Chisogne
        793 by: steve
        794 by: David Herren
        795 by: David Herren
        796 by: Christophe Chisogne
        797 by: Tex Texin
        798 by: Tex Texin
        799 by: steve
        800 by: Tex Texin

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
steve wrote:

I'm revamping the site, and on my local system (Apache 2, PHP 4.3.4), where
I'm doing the development, the exact same databases, using the exact same
browser (Firefox, FWIW) have problems with accented chars, which are shown
as a jumble of 2-3 chars.

Likely an encoding problem : latin1 (iso-8859-1) vs Unicode (utf-8)

Check display diffs with display/encoding menu on firefox
-- french Affichage/Encodage des caract�res

You can use the mozilla/firefox livehttpheaders tool [8] to check
which encoding is used by the Apache server. Shoud be latin1/iso-8859-1,
not utf-8, utf-16 etc

Avoid pblms by

- telling mysql server to use latin1 encoding

- Better html code in (generated?) html, like this
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

- setting Apache to use latin1 encoding if unspecified [1-2] in httpd.conf
  AddDefaultCharset iso-8859-1 (default in fact)
  # not this, unless you know what you're doing: AddDefaultCharset utf-8

And what is the general recommendation about storing accented characters in
text fields on MySQL DBs? Convert to htmlentities during the saving?

If using MySQL, allways use the latin1 (iso-8859-1) encoding. Dont mess with html entities in datas (only in presentation of datas)

You'll have to check for invalid chars in your html forms,
if your users use the infamous cp1252 charset encoding
(ex from word, if using word's "smart" quotes).
latin1 dont define some chars that cp1252 do (ex "smart" quotes),
which cause display problems (bad chars, '?' instead of char,
or even no html rendered by browser after invalid char)

See my comment on php.net [3] about this, where you'll find
a translation from invalid cp1252 to html entities.
Just create a translation to ascii/latin1 [4,5] that suits your taste.

Some usefull tools

- recode (ok, it's Perl) to translations between encodings
- perl Encode module
- src code (Perl) of the DecodeUTFKeys plugin of awstats [6]
  which can be used as inspiration src for writing equiv php code
- PHP multibyte strings (if you want utf-8 by ex) [7]

For more information, Google is your friend.

Hope this helps,

Christophe

[1] Apache 1.3 AddDefaultCharset directive
http://httpd.apache.org/docs/mod/core.html#adddefaultcharset

[2] Apache 2 AddDefaultCharset directive
http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset

[3] my comment about latin1 / cp1252 (26-Feb-2004) on php.net
http://www.php.net/strtr

[4] cp1252 to Unicode table
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

[5] Latin 1 (1252) -- and .gif Graphic representation
http://www.microsoft.com/typography/unicode/1252.htm

[6] Src code for DecodeUTFKeys plugin of awstats
http://cvs.sourceforge.net/viewcvs.py/awstats/awstats/wwwroot/cgi-bin/plugins/decodeutfkeys.pm

[7] PHP Multibyte String Functions
http://www.php.net/manual/en/ref.mbstring.php

[8] Install LiveHTTPHeaders
http://livehttpheaders.mozdev.org/installation.html

--- End Message ---
--- Begin Message ---
Christophe Chisogne wrote:

Christophe Chisogne wrote:

> Likely an encoding problem : latin1 (iso-8859-1) vs Unicode (utf-8)
> 
> Check display diffs with display/encoding menu on firefox
> -- french Affichage/Encodage des caract�res

Yeah, I checked that. The very same Firefox is working fine with the data on
the shared hosting, but not with my local system. To pin this down, I spent
some time getting the existing site (ie, the one currently working on the
shared hosting) working on my local system (wasn't easy, because I've made
loads of changes to tables for the new version of the site). The result?
Garbled characters. So - the exact same PHP code, MySQL tables and browser
work with the shared hosting but not the local system. I knew upgrading to
SuSE 9.1 was going to give me grief :-)

And that brings me to:

> You can use the mozilla/firefox livehttpheaders tool [8] to check
> which encoding is used by the Apache server. Shoud be latin1/iso-8859-1,
> not utf-8, utf-16 etc
> 
> Avoid pblms by
> 
> - telling mysql server to use latin1 encoding

I'd come to the conclusion it must be a problem in my Apache/PHP/MySQL
setup, as that's the only difference between the two cases. So looks like
it's not a PHP issue after all - so sorry about the wasted bandwidth.

That said, MySQL is already configured to use latin1 and Apache2 seems to be
set to iso-8859-1 (according to /etc/apache2/mod_mime-defaults.conf - can't
find another config file with any encoding setting). Hmmm.

> - Better html code in (generated?) html, like this
>����<meta�http-equiv="Content-Type"�content="text/html;
>����charset=iso-8859-1">

Already do that, though I tend to use iso-8859-15, which I understand is
pretty much the same with added euro support.

So, merci bien for a very useful response - I've printed it so I can go
through all the things you suggest. Much appreciated.

-- 
@+
Steve

--- End Message ---
--- Begin Message --- I am clearly missing something. Why would you recommend iso-8859-1 instead of the more universal utf-8? Virtually all of my work with php and mysql is in foreign languages, and as I have never had any problems once I set php, mysql and all my web pages to always use utf-8.

On Nov 17, 2004, at 6:34 AM, Christophe Chisogne wrote:

Shoud be latin1/iso-8859-1, not utf-8

/david

--
david herren - shoreham, vt us na terra solsys orionarm

Who would Jesus bomb?
--- End Message ---
--- Begin Message --- I am admittedly no pundit (or I wouldn't need to read this list), but utf-8 is working for me just fine in SuSE 9.1 as well as in Windows-XP and MacOSX. Data entry and retrieval on all three platforms is fault free, but of course you have to explicitly set utf-8 encoding in any web pages where users enter data.

On Nov 17, 2004, at 8:41 AM, steve wrote:

So - the exact same PHP code, MySQL tables and browser
work with the shared hosting but not the local system. I knew upgrading to
SuSE 9.1 was going to give me grief :-)

/david

--
david herren - shoreham, vt us na terra solsys orionarm

Who would Jesus bomb?
--- End Message ---
--- Begin Message --- David Herren wrote:
I am clearly missing something. Why would you recommend iso-8859-1 instead of the more universal utf-8?

Two reasons.

1. Particular, not Universal ;-) case : France. latin1 is largely enough.

   If western Europe/US is enough and you dont need chineese chars etc
   then it's way easier not to fight with Unicode problems

   (excluding browsers/spiders with no/poor Unicode support, font problems,
    transcoding problems, library problems, reducing storage size, etc)

2. technical

a) HTML and HTTP [4] defaults to latin1 encodings, its de-facto standard
   Ok, Apache can use utf-8 and html docs dont have to use next line.
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   Using another encodings requires (ok, little) extra work.
   But latin1 support is complete and 'out of the box' (nearly) everywhere,
   while it's not (yet) true for Unicode (utf-8, utf-16, etc)

b) MySQL only recently support utf-8 (v4.1) [1].
   and many server with MySQL dont support this yet
   Ex Debian: stable: 3.23.49, testing 4.0.22

c) PHP : I know latin1 (8bits) strings handling is simple and transparent.
   (even if *#!@ clients can use cp1252 encoding via word and cut&paste)
   To play with utf-8 encoded strings, you need to use
   special functions, multibyte strings [2].
   Not a big deal, but why mess with it if latin1 is enough?

d) Avoid problems by KISS principle if you can
   utf-8 support is not perfect in all platforms/languages/libs
   Ex The Perl Encode module requires at least v5.7
   but Debian stable has 5.6.1 (ok, I can use Encode::compat)

and mysql is in foreign languages, and as I have never had any problems once I set php, mysql and all my web pages to always use utf-8.

Ok, I stop playing devil's advocate here :)

When you must deal with many foreign languages (outside w-europe and us),
you dont really have the choice : Unicode is the only real option,
and utf-8 is the obvious choice of Unicode encoding (utf-16 is not by ex)
The lower end of Unicode 7bits is ASCII and 8bits is latin1,
so compatibility problems are minimized.

In conclusion, utf-8 / latin1 is a matter of choice, depending on
particular case and constraints. In my case (Belgium), latin1
is the obvious choice (west-europe/us is enough). Remember, KISS.

PS Whatever choice, we have to deal with the other choice.
   Ex you choose utf-8 but web client uses latin1 : transcoding needed.
   Ex you choose latin1 but webserver (say google) uses utf-8 : idem.

PPS Lots of sites have encoding problems, utf-8 rendered as latin1
    or reverse. A simple example on dmoz.be [5]

PPS Woow, you read this 'till here! Congratulations :)

[1] MySQL Manual : 1.2.2 The Main Features of MySQL
http://dev.mysql.com/doc/mysql/en/Features.html

[2] Multibyte String Functions
http://www.php.net/manual/en/ref.mbstring.php

[3] KISS principle
http://en.wikipedia.org/wiki/KISS_Principle

[4] See 3.7.1 Canonicalization and Text Defaults in HTTP/1.1 spec
ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt

[5] Search results for '�tude' on dmoz.be (Belgium, french)
http://search.dmoz.org/cgi-bin/search?search=%E9tude&all=no&cat=World%2FFran%E7ais%2FR%E9gional%2FEurope%2FBelgique

--- End Message ---
--- Begin Message ---
The world has changed and ISO 8859-1 is no longer adequate for Europe, or
even just France. You are using assumptions that used to be true and are no
longer.

1) ISO 8859-1 does not have the Euro character so is not really suitable for
France or Europe, unless you never have or discuss commercial transactions.

2) Also, with European enlargement, you can anticipate Eastern European
characters, which are not in latin-1 to become more prevalent and a
requirement. Greek which has been in the EU longer, is also not covered by
latin-1 but is less likely to be a requirement for business or other
applications outside of Greece.

3) HTML does not default to 8859-1 and specifically says a default should
not be assumed, despite http's default.
http://www.w3.org/TR/html401/charset.html#h-5.2.2

4) The limitations of MYSQL and PHP are as you say, but are not that much
work to get around. On the other hand, using escapes to represent the
characters missing from 8859-1 will make your source error prone and
difficult to read. It can also get in the way of your users uploading data
to your mysql database (if the UI generates "?" instead of escapes, or if
the escapes reduce the potential string length/field width of their
responses.)

5) If your site is successful, you will have to either go thru the work to
convert to utf-8 anyway, or suffer with multiple parallel systems using
different encodings on each.
"Doing it right" from the beginning is "keeping it simple".

...My �0.02

Tex Texin
Internationalization Architect,   Yahoo! Inc.
Phone: +1 408 349 7403
 


-----Original Message-----
From: Christophe Chisogne [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 18, 2004 1:35 AM
To: php-i18n
Subject: Re: [PHP-I18N] Accented characters


David Herren wrote:
> I am clearly missing something. Why would you recommend iso-8859-1
> instead of the more universal utf-8?

Two reasons.

1. Particular, not Universal ;-) case : France. latin1 is largely enough.

    If western Europe/US is enough and you dont need chineese chars etc
    then it's way easier not to fight with Unicode problems

    (excluding browsers/spiders with no/poor Unicode support, font problems,
     transcoding problems, library problems, reducing storage size, etc)

2. technical

a) HTML and HTTP [4] defaults to latin1 encodings, its de-facto standard
    Ok, Apache can use utf-8 and html docs dont have to use next line.
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    Using another encodings requires (ok, little) extra work.
    But latin1 support is complete and 'out of the box' (nearly) everywhere,
    while it's not (yet) true for Unicode (utf-8, utf-16, etc)

b) MySQL only recently support utf-8 (v4.1) [1].
    and many server with MySQL dont support this yet
    Ex Debian: stable: 3.23.49, testing 4.0.22

c) PHP : I know latin1 (8bits) strings handling is simple and transparent.
    (even if *#!@ clients can use cp1252 encoding via word and cut&paste)
    To play with utf-8 encoded strings, you need to use
    special functions, multibyte strings [2].
    Not a big deal, but why mess with it if latin1 is enough?

d) Avoid problems by KISS principle if you can
    utf-8 support is not perfect in all platforms/languages/libs
    Ex The Perl Encode module requires at least v5.7
    but Debian stable has 5.6.1 (ok, I can use Encode::compat)

> and mysql is in foreign languages, and as I have never had any 
> problems
> once I set php, mysql and all my web pages to always use utf-8.

Ok, I stop playing devil's advocate here :)

When you must deal with many foreign languages (outside w-europe and us),
you dont really have the choice : Unicode is the only real option, and utf-8
is the obvious choice of Unicode encoding (utf-16 is not by ex) The lower
end of Unicode 7bits is ASCII and 8bits is latin1, so compatibility problems
are minimized.

In conclusion, utf-8 / latin1 is a matter of choice, depending on particular
case and constraints. In my case (Belgium), latin1 is the obvious choice
(west-europe/us is enough). Remember, KISS.

PS Whatever choice, we have to deal with the other choice.
    Ex you choose utf-8 but web client uses latin1 : transcoding needed.
    Ex you choose latin1 but webserver (say google) uses utf-8 : idem.

PPS Lots of sites have encoding problems, utf-8 rendered as latin1
     or reverse. A simple example on dmoz.be [5]

PPS Woow, you read this 'till here! Congratulations :)

[1] MySQL Manual : 1.2.2 The Main Features of MySQL
http://dev.mysql.com/doc/mysql/en/Features.html

[2] Multibyte String Functions http://www.php.net/manual/en/ref.mbstring.php

[3] KISS principle
http://en.wikipedia.org/wiki/KISS_Principle

[4] See 3.7.1 Canonicalization and Text Defaults in HTTP/1.1 spec
ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt

[5] Search results for '�tude' on dmoz.be (Belgium, french)
http://search.dmoz.org/cgi-bin/search?search=%E9tude&all=no&cat=World%2FFran
%E7ais%2FR%E9gional%2FEurope%2FBelgique

-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

--- End Message ---
--- Begin Message ---
So Steve, I agree with Christophe's diagnosis, and would suggest that
probably you are seeing multiple jumbled chars, because the page contains
utf-8 and is being decoded as iso 8859-1. Tell your browser to use utf-8. If
that fixes the display, then tell your server the page is utf-8 and/or set
the meta http-equiv statement in the <head> section of the html that the
charset is utf-8, instead of iso8859-1.

hth

Tex Texin
Internationalization Architect,   Yahoo! Inc.
 


-----Original Message-----
From: Christophe Chisogne [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 17, 2004 3:34 AM
To: php-i18n
Subject: Re: [PHP-I18N] Accented characters


steve wrote:

> I'm revamping the site, and on my local system (Apache 2, PHP 4.3.4), 
> where I'm doing the development, the exact same databases, using the 
> exact same browser (Firefox, FWIW) have problems with accented chars, 
> which are shown as a jumble of 2-3 chars.

Likely an encoding problem : latin1 (iso-8859-1) vs Unicode (utf-8)

Check display diffs with display/encoding menu on firefox
-- french Affichage/Encodage des caract�res

You can use the mozilla/firefox livehttpheaders tool [8] to check which
encoding is used by the Apache server. Shoud be latin1/iso-8859-1, not
utf-8, utf-16 etc

Avoid pblms by

- telling mysql server to use latin1 encoding

- Better html code in (generated?) html, like this
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

- setting Apache to use latin1 encoding if unspecified [1-2] in httpd.conf
   AddDefaultCharset iso-8859-1 (default in fact)
   # not this, unless you know what you're doing: AddDefaultCharset utf-8

> And what is the general recommendation about storing accented 
> characters in text fields on MySQL DBs? Convert to htmlentities during 
> the saving?

If using MySQL, allways use the latin1 (iso-8859-1) encoding. Dont mess with
html entities in datas (only in presentation of datas)

You'll have to check for invalid chars in your html forms,
if your users use the infamous cp1252 charset encoding
(ex from word, if using word's "smart" quotes).
latin1 dont define some chars that cp1252 do (ex "smart" quotes), which
cause display problems (bad chars, '?' instead of char, or even no html
rendered by browser after invalid char)

See my comment on php.net [3] about this, where you'll find
a translation from invalid cp1252 to html entities.
Just create a translation to ascii/latin1 [4,5] that suits your taste.

Some usefull tools

- recode (ok, it's Perl) to translations between encodings
- perl Encode module
- src code (Perl) of the DecodeUTFKeys plugin of awstats [6]
   which can be used as inspiration src for writing equiv php code
- PHP multibyte strings (if you want utf-8 by ex) [7]

For more information, Google is your friend.

Hope this helps,

Christophe

[1] Apache 1.3 AddDefaultCharset directive
http://httpd.apache.org/docs/mod/core.html#adddefaultcharset

[2] Apache 2 AddDefaultCharset directive
http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset

[3] my comment about latin1 / cp1252 (26-Feb-2004) on php.net
http://www.php.net/strtr

[4] cp1252 to Unicode table
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

[5] Latin 1 (1252) -- and .gif Graphic representation
http://www.microsoft.com/typography/unicode/1252.htm

[6] Src code for DecodeUTFKeys plugin of awstats
http://cvs.sourceforge.net/viewcvs.py/awstats/awstats/wwwroot/cgi-bin/plugin
s/decodeutfkeys.pm

[7] PHP Multibyte String Functions
http://www.php.net/manual/en/ref.mbstring.php

[8] Install LiveHTTPHeaders
http://livehttpheaders.mozdev.org/installation.html

-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

--- End Message ---
--- Begin Message ---
Tex Texin wrote:

> So Steve, I agree with Christophe's diagnosis, and would suggest that
> probably you are seeing multiple jumbled chars, because the page contains
> utf-8 and is being decoded as iso 8859-1. Tell your browser to use utf-8.
> If that fixes the display, then tell your server the page is utf-8 and/or
> set the meta http-equiv statement in the <head> section of the html that
> the charset is utf-8, instead of iso8859-1.

I think this might be it. Why these chars are utf-8 is beyond me (hell, I'm
no programmer - I make my living writing & shooting pictures). Why these
pages appear correct when delivered by my shared hosting (these pages are
iso-8859-1) but not when served by my local system is baffling to me. Is
this to do with Apache settings?

I've now changed a page to UTF-8, as in:

<meta http-equiv="content-type" content="text/html;charset=UTF-8">

But my browsers are insisting on viewing as iso-8859-1 unless I manually
select utf-8. I have Firefox set to auto-detect->universal and have even
set the default to UTF-8 under preferences. I tried with IE from my Windows
box - same deal. Uh ... time for bed.

-- 
@+
Steve

--- End Message ---
--- Begin Message ---
The precedence rule that browsers are supposed to follow is that if the
server sets the charset in the http (transmission) protocol, then that
should be used.
If http is not set, then the meta html statement is followed.
So if after changing the meta statement the browser still insists on
defaulting to 8859-1,
Then probably your server is sending the page with http declaring ISO
8859-1.

Depending on your apache version, there was a bug in apache a short while
back, where it set the http charset to iso 8859-1, when it shouldn't set any
charset.

You might try saving the page locally and then opening it with a browser.
Opening it as a local file will eliminate http as a variable.

hth

Tex Texin
Internationalization Architect,   Yahoo! Inc.
 


-----Original Message-----
From: steve [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 18, 2004 3:15 PM
To: [EMAIL PROTECTED]
Subject: RE: [PHP-I18N] Accented characters


Tex Texin wrote:

> So Steve, I agree with Christophe's diagnosis, and would suggest that 
> probably you are seeing multiple jumbled chars, because the page 
> contains utf-8 and is being decoded as iso 8859-1. Tell your browser 
> to use utf-8. If that fixes the display, then tell your server the 
> page is utf-8 and/or set the meta http-equiv statement in the <head> 
> section of the html that the charset is utf-8, instead of iso8859-1.

I think this might be it. Why these chars are utf-8 is beyond me (hell, I'm
no programmer - I make my living writing & shooting pictures). Why these
pages appear correct when delivered by my shared hosting (these pages are
iso-8859-1) but not when served by my local system is baffling to me. Is
this to do with Apache settings?

I've now changed a page to UTF-8, as in:

<meta http-equiv="content-type" content="text/html;charset=UTF-8">

But my browsers are insisting on viewing as iso-8859-1 unless I manually
select utf-8. I have Firefox set to auto-detect->universal and have even set
the default to UTF-8 under preferences. I tried with IE from my Windows box
- same deal. Uh ... time for bed.

-- 
@+
Steve

-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

--- End Message ---

Reply via email to