Tom Mornini's post was interesting and possibly relevant. I'd certainly like to understand what's happening here. Might also tie in with Brigitte's recent utf8 issues: http://perlwelt.horus.at/Beispiele/Magic/PerlUnicodeMysql/
Try using the latest 5.8.1 development snapshot, which has the most correct uft8 behaviour: http://www.iki.fi/jhi/[EMAIL PROTECTED] (Report any problems with that to [EMAIL PROTECTED]) To help track uft8 issues I plan to make the neat() function (used internally for the DBI trace) use double quotes around strings that have the UTF8 flag set instead of the single quotes it normally uses. (Also strings with the UTF8 flag set won't have some high-bit chars displayed as dots.) Tim. On Tue, Jun 17, 2003 at 04:44:57PM -0700, Jonathan Leffler wrote: > > > > > Dear Tim (and Jay - there's new info here, Jay), > > Jay Konigsberg originally approached me with a problem whereby an o-umlaut > character in some data was being transformed into a two bytes with > different codes. After paring his initial 800-line reproduction down to > just 92 lines of code, I was able to remove DBD::Informix and replace it > with DBD::NullP and demonstrate that the problem appeared there, too, and > the problem seems to be in the DBI code itself. However, it is not > completely trivial; the reproduction still requires (seems to require) > XML::Parser::PerlSAX to have handled the data first. Simply sucking the > data in from a file and then passing it through DBI does not seem to > trigger this reaction. The string passed as a parameter to $sth->execute() > prints the unmodified value both before and after $sth->execute(), which > really has me puzzled. And it is not just o-umlaut that gets mapped; other > characters such as a-acute, a-grave, e-acute, e-grave, A-acute, A-grave, > E-acute, E-grave and y-umlaut also get trampled similarly. I've diagnosed > that the problem is in DBI because when run with PERL_DBI_DEBUG=2, the > entry for '-> execute for DBD::NullP::st (...)' shows the modified string > -- the transformation is certainly happening before DBD::NullP gets to see > it (and before DBD::Informix sees it either). > > Jay is using Perl 5.8.0 on AIX 4.3.3 compiled with GCC 2.7.x; I'm using > Perl 5.8.0 compiled on Solaris 7 with GCC 3.1 but now running on Solaris 8 > using GCC 3.3. Jay is using DBI 1.32; I am using DBI 1.37. I had to force > install libxml-perl 0.07 this morning because one test failed. I am up to > date within a day or so on almost all the modules I have installed - I did > an update with CPANPLUS this morning (DBD::ODBC and DBD:: Multiplex are out > of date, though CPANPLUS says I've got D::M 0.90 installed and need to > install D::M 0.90, which has me confused). > > Here's the test script - I'm not sure how much more it can be compressed. > It needs the file jknullp.xml, which contains all the accented characters I > mentioned. > > Is there a possibility that the XML stuff is somehow setting up the Perl > Unicode system so that the Unicode thinks the characters should be recoded > from ISO 8859-1 (as explicitly stated in the XML file) and is UTF-8 > encoding them? Let's see: the input character codes are: > > Name 8859-1 DBI trace UTF-8 > o-umlaut 0xF6 0xC3 0xB6 0xC3 0xB6 > a-grave 0xE0 0xC3 0xA0 0xC3 0xA0 > a-acute 0xE1 0xC3 0xA1 0xC3 0xA1 > A-grave 0xC0 0xC3 0x2E * 0xC3 0x80 > A-acute 0xC1 0xC3 0x2E * 0xC3 0x81 > E-grave 0xC8 0xC3 0x2E * 0xC3 0x88 > E-acute 0xC9 0xC3 0x89 0xC3 0x89 > e-grave 0xE8 0xC3 0xA8 0xC3 0xA8 > e-acute 0xE9 0xC3 0xA9 0xC3 0xA9 > y-umlaut 0xFF 0xC3 0xBF 0xC3 0xBF > > Except for the three starred characters, the DBI trace is showing a valid > mapping from ISO 8859-1 to UTF-8. The three starred characters are invalid > UTF-8 sequences; the second byte should start with bits 10 to be valid. > > Any ideas on how to prevent this transformation from occurring? Is > reversion to Perl 5.6.1 the answer? (Ugh if it is). Or will 5.8.1 fix > this? Or is it something that should not be fixed? But then how does a > person parsing XML deal with this? Or is it a property of the particular > XML parser that Jay is using? > > HELP!!! > > (See attached file: jknullp.tgz) > > The tar file contains jknullp.pl (the Perl script), jknullp.trace (the > output from running jknullp.pl on Solaris 8), and jknullp.xml (the XML > source with accented characters in ISO 8859-1, as noted in the XML encoding > information). They all unpack into the current directory. > > -- > Jonathan Leffler ([EMAIL PROTECTED]) > STSM, Informix Database Engineering, IBM Data Management > 4100 Bohannon Drive, Menlo Park, CA 94025 > Tel: +1 650-926-6921 Tie-Line: 630-6921 > "I don't suffer from insanity; I enjoy every minute of it!"