It seems that the UTF-8 support in Perl is still transitional. By that I mean that there are situations where you can find strings being converted back and forth between UTF-8 and the local character set (Latin-1 in my case) several times as it passes through the system.

Here's a chain I've observed on one of my machines:

   DB -> daemon -> HTTP -> ASP -> Browser
 Latin-1  UTF-8      Latin-1       UTF-8

(View with a fixed-space font.)

DB is a special-purpose database we use; there's some Latin-1 encoded data in it.

daemon is a background process written in Perl that sits between the special database and the Apache::ASP code. When it pulls the data in from the database, Perl upconverts the data to UTF-8 on systems like Red Hat Linux 9 where the LANG variable is set to something like en_US.UTF-8.

The daemon uses HTTP::Daemon to interface with the ASP code. We do it this way for reasons that aren't germane to the discussion. What's important is that in the ASP code, the LANG variable is unset for whatever reason. Therefore, Perl seems to convert the UTF-8 encoded data back into Latin-1, probably within the HTTP parsing code. It's clear, at least, that it's in Latin-1 throughout the ASP processing.

The data finally seems to be converted back to UTF-8 by Apache before sending it off to the browser. Presumably this is because modern browsers advertise UTF-8 support.

Right now, we're coping okay with these conversions. The only concern is that the conversion from UTF-8 back to Latin-1 is unnecessary. Some day, we wight decide to go in and force things to maintain the data in UTF-8 all the way through the chain beyond that first conversion, for efficiency. Does anyone know how we can force Perl to keep the data in UTF-8 format, even when the LANG variable isn't set?

Incidentally, we see a different conversion chain on Red Hat 7.2, which uses Perl 5.6.1 and Apache 1.3. The data seems to stay in Latin-1 until sometime within the ASP code, where it's converted to UTF-8. Very strange, but since the last conversion is the only one that matters to our code, it works out for the best in our case. Just FYI, for the mail archive diggers. :)

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to