I'm glad you've found something that works. But can someone summarise the causes/issues into something we can all understand? [I don't have time to try to do that for myself.]
Obviously the current situation is not good. But I need to have a better understanding of *exactly* what's going on at all levels (app code and driver) before I can think much about what needs fixing or documenting, and how. Tim. On Wed, Jun 18, 2003 at 01:20:45PM -0700, Jonathan Leffler wrote: > > > > > Dear Tom, Tim, Jay, > > Sorry for the top posting - Lotus Notes makes anything else impossibly > cumbersome. > > Thanks for the information, and I'm sorry you ran into the same problem. I > tried your pack/unpack trick with Perl 5.8.0 and that does not seem to work > -- it converted everything into UTF-8 somewhere along the line, which also > was the behaviour I got in Perl 5.6.1 when I tried the original example > code I sent. > > I tried a couple of other tricks, including 'use bytes;' -- don't; it bites > :-) > > What seems to work for me -- at least, the notation is perfectly explicit > -- is: > > use Encode; > ... > $action_note = encode("iso-8859-1", $action_note); > $sth->execute($action_note); > > There's a nice CAVEAT in the documenation: 'When you run "$octets = > encode("utf8", $string)", then $octets may not be equal to $string. Though > they both contain the same data, the utf8 flag for $octets is always off. > When you encode anything, the utf8 flag of the result is always off, even > when it contains completely valid utf8 string.' > > In the example code sent y'day, I had to place the "use Encode;" after the > "package MyHandler;" or explicitly qualify the call with > Encode::encode(...). I fear that this would be necessary on all strings > manipulated by XML if they contain non-ASCII characters. I suspect that > the use of map would help with variables in bulk: > > @array = map { encode("iso-8859-1", $_) } @array; > ($var1, $var2, $var3) = map { encode("iso-8859-1", $_) $var1, $var2, > $var3; > > The sequence of manual bashing that led to the plausibly correct solution > was: > > perldoc perlunicode > perldoc encoding > perldoc Encode > > I note in passing that the 3rd Edition of the Camel book was not a great > deal of help - it describes Perl 5.6.0 or thereabouts, and Unicode handling > has changed in some respects since then. > > Beware Unicode - thank goodness for Encode! > > -- > Jonathan Leffler ([EMAIL PROTECTED]) > STSM, Informix Database Engineering, IBM Data Management > 4100 Bohannon Drive, Menlo Park, CA 94025 > Tel: +1 650-926-6921 Tie-Line: 630-6921 > "I don't suffer from insanity; I enjoy every minute of it!" > > > > > |---------+----------------------------> > | | Tom Mornini | > | | <[EMAIL PROTECTED]| > | | ia.com> | > | | | > | | 06/17/2003 08:43 | > | | PM | > | | | > |---------+----------------------------> > > >---------------------------------------------------------------------------------------------------------------------------------------------| > | > | > | To: Jonathan Leffler/Menlo Park/[EMAIL PROTECTED] > | > | cc: [EMAIL PROTECTED], Tim Bunce <[EMAIL PROTECTED]>, Jay Konigsberg > <[EMAIL PROTECTED]> | > | Subject: Re: DBI (1.32, 1.37) transforms data before passing it to the > driver? (XML and UTF-8 getting in the way?) | > | > | > > >---------------------------------------------------------------------------------------------------------------------------------------------| > > > > > I *just* tracked down a similar problem in our system. > > It turns out that we were receiving data from SOAP::Lite that Perl had > marked internally as UTF-8, even though there were NO characters that > required it. > > Another value that was completely UTF-8 clear was being *magically* > transformed by a Perl join operation, causing a single \xa3 to become > \x{c2a3} when Perl internally upgraded the single byte string into a > UTF-8 for joining with the UTF-8 string. > > Apparently this operation caused an ISO Latin-1 to UTF-8 conversion to > take place. :-) > > Aren't transitions to multibyte encodings wonderful? > > Our fix was to downgrade the UTF-8 string before the join. > > I simplest way I found to do this was: > > $string = pack('H*',unpack('H*',$string)); > > Devel::Peek as my *friend* on this one! > > We're using 5.6.1 so downgrading will not help this situation. In fact, > it's not really much of a situation at all, as everything is working > 'correctly'. > > On Tuesday, June 17, 2003, at 04:44 PM, Jonathan Leffler wrote: > > > Dear Tim (and Jay - there's new info here, Jay), > > > > Jay Konigsberg originally approached me with a problem whereby an > > o-umlaut > > character in some data was being transformed into a two bytes with > > different codes. After paring his initial 800-line reproduction down > > to > > just 92 lines of code, I was able to remove DBD::Informix and replace > > it > > with DBD::NullP and demonstrate that the problem appeared there, too, > > and > > the problem seems to be in the DBI code itself. However, it is not > > completely trivial; the reproduction still requires (seems to require) > > XML::Parser::PerlSAX to have handled the data first. Simply sucking > > the > > data in from a file and then passing it through DBI does not seem to > > trigger this reaction. The string passed as a parameter to > > $sth->execute() > > prints the unmodified value both before and after $sth->execute(), > > which > > really has me puzzled. And it is not just o-umlaut that gets mapped; > > other > > characters such as a-acute, a-grave, e-acute, e-grave, A-acute, > > A-grave, > > E-acute, E-grave and y-umlaut also get trampled similarly. I've > > diagnosed > > that the problem is in DBI because when run with PERL_DBI_DEBUG=2, the > > entry for '-> execute for DBD::NullP::st (...)' shows the modified > > string > > -- the transformation is certainly happening before DBD::NullP gets to > > see > > it (and before DBD::Informix sees it either). > > > > Jay is using Perl 5.8.0 on AIX 4.3.3 compiled with GCC 2.7.x; I'm using > > Perl 5.8.0 compiled on Solaris 7 with GCC 3.1 but now running on > > Solaris 8 > > using GCC 3.3. Jay is using DBI 1.32; I am using DBI 1.37. I had to > > force > > install libxml-perl 0.07 this morning because one test failed. I am > > up to > > date within a day or so on almost all the modules I have installed - I > > did > > an update with CPANPLUS this morning (DBD::ODBC and DBD:: Multiplex > > are out > > of date, though CPANPLUS says I've got D::M 0.90 installed and need to > > install D::M 0.90, which has me confused). > > > > Here's the test script - I'm not sure how much more it can be > > compressed. > > It needs the file jknullp.xml, which contains all the accented > > characters I > > mentioned. > > > > Is there a possibility that the XML stuff is somehow setting up the > > Perl > > Unicode system so that the Unicode thinks the characters should be > > recoded > > from ISO 8859-1 (as explicitly stated in the XML file) and is UTF-8 > > encoding them? Let's see: the input character codes are: > > > > Name 8859-1 DBI trace UTF-8 > > o-umlaut 0xF6 0xC3 0xB6 0xC3 0xB6 > > a-grave 0xE0 0xC3 0xA0 0xC3 0xA0 > > a-acute 0xE1 0xC3 0xA1 0xC3 0xA1 > > A-grave 0xC0 0xC3 0x2E * 0xC3 0x80 > > A-acute 0xC1 0xC3 0x2E * 0xC3 0x81 > > E-grave 0xC8 0xC3 0x2E * 0xC3 0x88 > > E-acute 0xC9 0xC3 0x89 0xC3 0x89 > > e-grave 0xE8 0xC3 0xA8 0xC3 0xA8 > > e-acute 0xE9 0xC3 0xA9 0xC3 0xA9 > > y-umlaut 0xFF 0xC3 0xBF 0xC3 0xBF > > > > Except for the three starred characters, the DBI trace is showing a > > valid > > mapping from ISO 8859-1 to UTF-8. The three starred characters are > > invalid > > UTF-8 sequences; the second byte should start with bits 10 to be valid. > > > > Any ideas on how to prevent this transformation from occurring? Is > > reversion to Perl 5.6.1 the answer? (Ugh if it is). Or will 5.8.1 fix > > this? Or is it something that should not be fixed? But then how does > > a > > person parsing XML deal with this? Or is it a property of the > > particular > > XML parser that Jay is using? > > > > HELP!!! > > > > (See attached file: jknullp.tgz) > > > > The tar file contains jknullp.pl (the Perl script), jknullp.trace (the > > output from running jknullp.pl on Solaris 8), and jknullp.xml (the XML > > source with accented characters in ISO 8859-1, as noted in the XML > > encoding > > information). They all unpack into the current directory. > > > > -- > > Jonathan Leffler ([EMAIL PROTECTED]) > > STSM, Informix Database Engineering, IBM Data Management > > 4100 Bohannon Drive, Menlo Park, CA 94025 > > Tel: +1 650-926-6921 Tie-Line: 630-6921 > > "I don't suffer from insanity; I enjoy every minute of it!" > > <jknullp.tgz> > > > >