Re: DBI (1.32, 1.37) transforms data before passing it to the driver? (XML and UTF-8 getting in the way?)

Tim Bunce Thu, 19 Jun 2003 02:00:55 -0700

I'm glad you've found something that works.

But can someone summarise the causes/issues into something we can
all understand? [I don't have time to try to do that for myself.]


Obviously the current situation is not good. But I need to have a
better understanding of *exactly* what's going on at all levels
(app code and driver) before I can think much about what needs
fixing or documenting, and how.

Tim.


On Wed, Jun 18, 2003 at 01:20:45PM -0700, Jonathan Leffler wrote:
> 
> 
> 
> 
> Dear Tom, Tim, Jay,
> 
> Sorry for the top posting - Lotus Notes makes anything else impossibly
> cumbersome.
> 
> Thanks for the information, and I'm sorry you ran into the same problem.  I
> tried your pack/unpack trick with Perl 5.8.0 and that does not seem to work
> -- it converted everything into UTF-8 somewhere along the line, which also
> was the behaviour I got in Perl 5.6.1 when I tried the original example
> code I sent.
> 
> I tried a couple of other tricks, including 'use bytes;' -- don't; it bites
> :-)
> 
> What seems to work for me -- at least, the notation is perfectly explicit
> -- is:
> 
>       use Encode;
>       ...
>       $action_note = encode("iso-8859-1", $action_note);
>       $sth->execute($action_note);
> 
> There's a nice CAVEAT in the documenation: 'When you run "$octets =
> encode("utf8", $string)", then $octets may not be equal to $string.  Though
> they both contain the same data, the utf8 flag for $octets is always off.
> When you encode anything, the utf8 flag of the result is always off, even
> when it contains completely valid utf8 string.'
> 
> In the example code sent y'day, I had to place the "use Encode;" after the
> "package MyHandler;" or explicitly qualify the call with
> Encode::encode(...).  I fear that this would be necessary on all strings
> manipulated by XML if they contain non-ASCII characters.  I suspect that
> the use of map would help with variables in bulk:
> 
>       @array  = map { encode("iso-8859-1", $_) } @array;
>       ($var1, $var2, $var3) = map { encode("iso-8859-1", $_) $var1, $var2,
> $var3;
> 
> The sequence of manual bashing that led to the plausibly correct solution
> was:
> 
>       perldoc perlunicode
>       perldoc encoding
>       perldoc Encode
> 
> I note in passing that the 3rd Edition of the Camel book was not a great
> deal of help - it describes Perl 5.6.0 or thereabouts, and Unicode handling
> has changed in some respects since then.
> 
> Beware Unicode - thank goodness for Encode!
> 
> --
> Jonathan Leffler ([EMAIL PROTECTED])
> STSM, Informix Database Engineering, IBM Data Management
> 4100 Bohannon Drive, Menlo Park, CA 94025
> Tel: +1 650-926-6921   Tie-Line: 630-6921
>       "I don't suffer from insanity; I enjoy every minute of it!"
> 
> 
> 
> 
> |---------+---------------------------->
> |         |           Tom Mornini      |
> |         |           <[EMAIL PROTECTED]|
> |         |           ia.com>          |
> |         |                            |
> |         |           06/17/2003 08:43 |
> |         |           PM               |
> |         |                            |
> |---------+---------------------------->
>   
> >---------------------------------------------------------------------------------------------------------------------------------------------|
>   |                                                                                  
>                                                            |
>   |       To:       Jonathan Leffler/Menlo Park/[EMAIL PROTECTED]                    
>                                                                    |
>   |       cc:       [EMAIL PROTECTED], Tim Bunce <[EMAIL PROTECTED]>, Jay Konigsberg 
> <[EMAIL PROTECTED]>                                     |
>   |       Subject:  Re: DBI (1.32, 1.37) transforms data before passing it to the 
> driver?  (XML and UTF-8 getting in the way?)                  |
>   |                                                                                  
>                                                            |
>   
> >---------------------------------------------------------------------------------------------------------------------------------------------|
> 
> 
> 
> 
> I *just* tracked down a similar problem in our system.
> 
> It turns out that we were receiving data from SOAP::Lite that Perl had
> marked internally as UTF-8, even though there were NO characters that
> required it.
> 
> Another value that was completely UTF-8 clear was being *magically*
> transformed by a Perl join operation, causing a single \xa3 to become
> \x{c2a3} when Perl internally upgraded the single byte string into a
> UTF-8 for joining with the UTF-8 string.
> 
> Apparently this operation caused an ISO Latin-1 to UTF-8 conversion to
> take place. :-)
> 
> Aren't transitions to multibyte encodings wonderful?
> 
> Our fix was to downgrade the UTF-8 string before the join.
> 
> I simplest way I found to do this was:
> 
> $string = pack('H*',unpack('H*',$string));
> 
> Devel::Peek as my *friend* on this one!
> 
> We're using 5.6.1 so downgrading will not help this situation. In fact,
> it's not really much of a situation at all, as everything is working
> 'correctly'.
> 
> On Tuesday, June 17, 2003, at 04:44 PM, Jonathan Leffler wrote:
> 
> > Dear Tim (and Jay - there's new info here, Jay),
> >
> > Jay Konigsberg originally approached me with a problem whereby an
> > o-umlaut
> > character in some data was being transformed into a  two bytes with
> > different codes.  After paring his initial 800-line reproduction down
> > to
> > just 92 lines of code, I was able to remove DBD::Informix and replace
> > it
> > with DBD::NullP and demonstrate that the problem appeared there, too,
> > and
> > the problem seems to be in the DBI code itself.  However, it is not
> > completely trivial; the reproduction still requires (seems to require)
> > XML::Parser::PerlSAX to have handled the data first.  Simply sucking
> > the
> > data in from a file and then passing it through DBI does not seem to
> > trigger this reaction.  The string passed as a parameter to
> > $sth->execute()
> > prints the unmodified value both before and after $sth->execute(),
> > which
> > really has me puzzled.  And it is not just o-umlaut that gets mapped;
> > other
> > characters such as a-acute, a-grave, e-acute, e-grave, A-acute,
> > A-grave,
> > E-acute, E-grave and y-umlaut also get trampled similarly.  I've
> > diagnosed
> > that the problem is in DBI because when run with PERL_DBI_DEBUG=2, the
> > entry for '-> execute for DBD::NullP::st (...)' shows the modified
> > string
> > -- the transformation is certainly happening before DBD::NullP gets to
> > see
> > it (and before DBD::Informix sees it either).
> >
> > Jay is using Perl 5.8.0 on AIX 4.3.3 compiled with GCC 2.7.x; I'm using
> > Perl 5.8.0 compiled on Solaris 7 with GCC 3.1 but now running on
> > Solaris 8
> > using GCC 3.3.  Jay is using DBI 1.32; I am using DBI 1.37.  I had to
> > force
> > install libxml-perl 0.07 this morning because one test failed.  I am
> > up to
> > date within a day or so on almost all the modules I have installed - I
> > did
> > an update with CPANPLUS this morning (DBD::ODBC and DBD:: Multiplex
> > are out
> > of date, though CPANPLUS says I've got D::M 0.90 installed and need to
> > install D::M 0.90, which has me confused).
> >
> > Here's the test script - I'm not sure how much more it can be
> > compressed.
> > It needs the file jknullp.xml, which contains all the accented
> > characters I
> > mentioned.
> >
> > Is there a possibility that the XML stuff is somehow setting up the
> > Perl
> > Unicode system so that the Unicode thinks the characters should be
> > recoded
> > from ISO 8859-1 (as explicitly stated in the XML file) and is UTF-8
> > encoding them?   Let's see: the input character codes are:
> >
> > Name        8859-1      DBI trace         UTF-8
> > o-umlaut    0xF6        0xC3 0xB6         0xC3 0xB6
> > a-grave     0xE0        0xC3 0xA0         0xC3 0xA0
> > a-acute     0xE1        0xC3 0xA1         0xC3 0xA1
> > A-grave     0xC0        0xC3 0x2E *       0xC3 0x80
> > A-acute     0xC1        0xC3 0x2E *       0xC3 0x81
> > E-grave     0xC8        0xC3 0x2E *       0xC3 0x88
> > E-acute     0xC9        0xC3 0x89         0xC3 0x89
> > e-grave     0xE8        0xC3 0xA8         0xC3 0xA8
> > e-acute     0xE9        0xC3 0xA9         0xC3 0xA9
> > y-umlaut    0xFF        0xC3 0xBF         0xC3 0xBF
> >
> > Except for the three starred characters, the DBI trace is showing a
> > valid
> > mapping from ISO 8859-1 to UTF-8.  The three starred characters are
> > invalid
> > UTF-8 sequences; the second byte should start with bits 10 to be valid.
> >
> > Any ideas on how to prevent this transformation from occurring?  Is
> > reversion to Perl 5.6.1 the answer?  (Ugh if it is).  Or will 5.8.1 fix
> > this?  Or is it something that should not be fixed?  But then how does
> > a
> > person parsing XML deal with this?  Or is it a property of the
> > particular
> > XML parser that Jay is using?
> >
> > HELP!!!
> >
> > (See attached file: jknullp.tgz)
> >
> > The tar file contains jknullp.pl (the Perl script), jknullp.trace (the
> > output from running jknullp.pl on Solaris 8), and jknullp.xml (the XML
> > source with accented characters in ISO 8859-1, as noted in the XML
> > encoding
> > information).  They all unpack into the current directory.
> >
> > --
> > Jonathan Leffler ([EMAIL PROTECTED])
> > STSM, Informix Database Engineering, IBM Data Management
> > 4100 Bohannon Drive, Menlo Park, CA 94025
> > Tel: +1 650-926-6921   Tie-Line: 630-6921
> >       "I don't suffer from insanity; I enjoy every minute of it!"
> > <jknullp.tgz>
> 
> 
> 
>

Re: DBI (1.32, 1.37) transforms data before passing it to the driver? (XML and UTF-8 getting in the way?)

Reply via email to