php-i18n Digest 27 Nov 2004 12:39:03 -0000 Issue 263

Topics (messages 808 through 812):

Re: Accented characters
        808 by: steve
        809 by: Tex Texin
        810 by: steve
        811 by: Tex Texin
        812 by: Jacob Singh

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
David Powers wrote:

> Try this right at the top of the page before any headers are sent, and
> remove the charset metatag from the document <head>:
> 
> <?php header('Content-Type: text/html; charset=utf-8'); ?>

That just has the same effect as setting the meta tag to utf-8 - ie, it
works on the local system, doesn't on the live one, even though - as far as
I can work out - everything else is effectively equal.

Ah well - I guess I can always upload the new version of the site and, if
there's a problem, set utf-8 in the headers and have done with it.

However, are there likely to be issues with the MySQL server using latin1
(which I seem to have no choice about) and the pages using utf-8? The MySQL
tables do store info entered by users into web forms...

-- 
@+
Steve

--- End Message ---
--- Begin Message ---
Steve,
The page looks to me to be a mix of encodings.
There is a euro character and an e-acute which are windows-1252.
(The euro is not in iso 8859-1, although many fonts will behave as if 8859-1
was 1252, so it looks ok.)

Then there are characters in front of the telephone numbers: "Ti??:"
I am not sure what was intended. I suspect somewhere along the way an
incorrect encoding conversion was performed. If it is coming from your
database, it may have been corrupted before it was stored. If you are lucky
the data in the database is ok, and the incorrect conversion is being
performed between the retrieval of the data and the inclusion in the web
page.

The way to debug this is to first recognize you have multiple components
(http, php, html, mysql, ...) containing or processing text, and text
processing always requires encoding to be taken into account.
So I would suggest embedding some non-ascii windows-1252 (iso 8859-1)
characters in your html template, php code, and mysql database. Then verify
how the characters appear when served. You may find that some are correct
and some are not. The ones that are not, will identify which components are
not in sync with the rest of the system.

Problems can have a number of causes: the encoding may be incorrectly
labeled, the wrong conversion can be performed, a conversion that should be
performed is not, or a conversion that should be performed is performed
twice.

An important clue is to consider the characters you embedded and how they
changed when they were corrupted. For example, if you embed the euro (128)
and it turns into 3 bytes, it is likely a 1252 to utf-8 conversion was
performed. If the bytes are the same, but the display is different, then the
data is being decoded as if it were another encoding. (ie treating 1252 as
utf-8 or vice versa).

Reviewing the characters in each component will help you narrow down where
the problem(s) reside.

Hth
tex


Tex Texin
Internationalization Architect,   Yahoo! Inc.
 
 

--- End Message ---
--- Begin Message ---
Tex Texin wrote:

> Steve,
> The page looks to me to be a mix of encodings.
> There is a euro character and an e-acute which are windows-1252.
> (The euro is not in iso 8859-1, although many fonts will behave as if
> 8859-1 was 1252, so it looks ok.)
> 
> Then there are characters in front of the telephone numbers: "Ti??:"
> I am not sure what was intended. I suspect somewhere along the way an
> incorrect encoding conversion was performed. If it is coming from your
> database, it may have been corrupted before it was stored. If you are
> lucky the data in the database is ok, and the incorrect conversion is
> being performed between the retrieval of the data and the inclusion in the
> web page.

Yeah, sorry Tex - when I included that URL I should have given some more
info. That's obviously the old version of the site, when I was even lower
down the learning curve(!). Some of those dodgy characters have arisen
because I simply typed the accented characters from the keyboard into the
PHP code, rather than using HTML entities like &eacute;. I started creating
the site on a Windows box and moved to a Linux one...

The bit that concerns me is this: on the page www.webvivant.com/market/, if
you enter the word 'partner' into the search box and hit 'search now...' it
should bring up just one message. At the end of that message is the
person's location, which is the Midi-Pyr�n�es. That text is pulled from the
MySQL table. The accented characters work fine on the live system when the
page is delivered/viewed as iso-8859-1. On my local system, the page needs
to be delivered/viewed as utf-8.

-- 
@+
Steve

--- End Message ---
--- Begin Message ---
The search returns 2 entries and I don't see the location on either.
In any event, if your mysql database can be accessed independent of the web
page and php, and is correctly encoded as 8859-1 or more correctly as 1252,
then somewhere between the db and the web server an encoding conversion is
being applied to change the data to utf-8.

Since php is in between, verify the encoding of the data that is retrieved
by php.
If it is still 1252, then verify that php is sending data to the server as
1252 (directly from php as well as data from the database).

If you have any utilities that munge the data anywhere along the way, verify
those also.
hth

Tex Texin
Internationalization Architect,   Yahoo! Inc.
 
 


-----Original Message-----
From: steve [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 22, 2004 2:40 AM
To: [EMAIL PROTECTED]
Subject: RE: [PHP-I18N] Accented characters


Tex Texin wrote:

> Steve,
> The page looks to me to be a mix of encodings.
> There is a euro character and an e-acute which are windows-1252. (The 
> euro is not in iso 8859-1, although many fonts will behave as if 
> 8859-1 was 1252, so it looks ok.)
> 
> Then there are characters in front of the telephone numbers: "Ti??:" I 
> am not sure what was intended. I suspect somewhere along the way an 
> incorrect encoding conversion was performed. If it is coming from your 
> database, it may have been corrupted before it was stored. If you are 
> lucky the data in the database is ok, and the incorrect conversion is 
> being performed between the retrieval of the data and the inclusion in 
> the web page.

Yeah, sorry Tex - when I included that URL I should have given some more
info. That's obviously the old version of the site, when I was even lower
down the learning curve(!). Some of those dodgy characters have arisen
because I simply typed the accented characters from the keyboard into the
PHP code, rather than using HTML entities like &eacute;. I started creating
the site on a Windows box and moved to a Linux one...

The bit that concerns me is this: on the page www.webvivant.com/market/, if
you enter the word 'partner' into the search box and hit 'search now...' it
should bring up just one message. At the end of that message is the person's
location, which is the Midi-Pyr�n�es. That text is pulled from the MySQL
table. The accented characters work fine on the live system when the page is
delivered/viewed as iso-8859-1. On my local system, the page needs to be
delivered/viewed as utf-8.

-- 
@+
Steve

-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

--- End Message ---
--- Begin Message ---
Hi group!

My name is Jacob, I'm re-working a site (calabashmusic.com) which is a music download site specializing in music from around the world. We have tie-ups with numerous international parters as well. The current version is Latin-1 encoded, and is only shown in English. The new one will me M17N and will need UTF because we are doing japan.

Anyway, I had the same problem as steve (I think, I've read the entire thread). It was a HASSTLE. we got a new server after a crash so I uploaded my DB dump from the local box onto a fresh mySQL 3.23 and Apache 2. It seemed more or less okay, I didn't test extensively and a week later after 10,000 inserts had been made, I realized the accented chars were screwed up, and we use about 25 of them (28 to be exact). So I looked in the DB and lo and behold, they were all corrupted and replaced with the char combos Steve mentioned. To be specific it was an upper case A with two dots above it followed by another char, usually something weird like the Euro or a Cubed exponent.

After Stuggling for much time, I eventually wrote a PHP script to traverse the Database, all tables and columns and change these screwy chars back to their equivilents which I had to match across simular records in my old DB. I also upgraded to MySQL 4.0.2 because I thought this might help.

Now everything seems to be pretty much okay, but I'm still shaking from the experience. I have no idea how it happened, or if I'll repeat it. I have MySQL 4.0.2 with latin1, apache with ISO-8859-1 as Default Charset.

Regarding what happened, any ideas? Regarding the future, how do I set up a purely unicode environment, and how do I convert my old data to it?

Thanks
Jacob

Steve wrote:
I know, I know, you've had this a million times. But I have Googled on this
and not come up with anything that really matches my problem, so I need
some advice about refining my search. Here's the situation:

I have a site using shared hosting which is running Apache 1.3.27, PHP
4.1.2. As this is a site about France, accented characters are used a lot,
but have never been a problem. Some such characters are entered via HTML
forms on the site, others are in MySQL databases where they have been
entered on my local system via MySQLcc. The characters are all 'proper'
characters - ie, they are not stored as HTML entities.

And that's all working fine. But...

I'm revamping the site, and on my local system (Apache 2, PHP 4.3.4), where
I'm doing the development, the exact same databases, using the exact same
browser (Firefox, FWIW) have problems with accented chars, which are shown
as a jumble of 2-3 chars.

Any ideas why there might be this discrepancy? Could it have anything to do
with the way PHP is installed on the two systems?

And what is the general recommendation about storing accented characters in
text fields on MySQL DBs? Convert to htmlentities during the saving?
Problem with that is that I might need the same databases for generating
email mail-outs where I'm not using HTML...

This is a problem I thought I'd solved ages ago, so my head's in a bit of
spin. Any advice is most welcome.


--- End Message ---

Reply via email to