RE: DBI (1.32, 1.37) transforms data before passing it to the driver? (XML and UTF-8 getting in the way?)

Konigsberg, Jay Thu, 19 Jun 2003 01:23:05 -0700

Jonathan,

Yes, 'encode' works.


I realize the problem isn't really fixed, but it's fixed enough for this humble (and 
humbled) Applications Programmer/DBA.

Thank you to everyone,
Jay

Jay Konigsberg
Database Administrator
TowerRecords.com
(916) 373-2406    Fax: (916) 373-2930
[EMAIL PROTECTED]

If something is worth doing it's worth doing correctly.

 -----Original Message-----
From:   Jonathan Leffler [mailto:[EMAIL PROTECTED] 
Sent:   Wednesday, June 18, 2003 1:21 PM
To:     Tom Mornini
Cc:     [EMAIL PROTECTED]; Konigsberg, Jay; Tim Bunce
Subject:        Re: DBI (1.32, 1.37) transforms data before passing it to the driver?  
(XML and UTF-8 getting in the way?)





Dear Tom, Tim, Jay,

Sorry for the top posting - Lotus Notes makes anything else impossibly
cumbersome.

Thanks for the information, and I'm sorry you ran into the same problem.  I
tried your pack/unpack trick with Perl 5.8.0 and that does not seem to work
-- it converted everything into UTF-8 somewhere along the line, which also
was the behaviour I got in Perl 5.6.1 when I tried the original example
code I sent.

I tried a couple of other tricks, including 'use bytes;' -- don't; it bites
:-)

What seems to work for me -- at least, the notation is perfectly explicit
-- is:

      use Encode;
      ...
      $action_note = encode("iso-8859-1", $action_note);
      $sth->execute($action_note);

There's a nice CAVEAT in the documenation: 'When you run "$octets =
encode("utf8", $string)", then $octets may not be equal to $string.  Though
they both contain the same data, the utf8 flag for $octets is always off.
When you encode anything, the utf8 flag of the result is always off, even
when it contains completely valid utf8 string.'

In the example code sent y'day, I had to place the "use Encode;" after the
"package MyHandler;" or explicitly qualify the call with
Encode::encode(...).  I fear that this would be necessary on all strings
manipulated by XML if they contain non-ASCII characters.  I suspect that
the use of map would help with variables in bulk:

      @array  = map { encode("iso-8859-1", $_) } @array;
      ($var1, $var2, $var3) = map { encode("iso-8859-1", $_) $var1, $var2,
$var3;

The sequence of manual bashing that led to the plausibly correct solution
was:

      perldoc perlunicode
      perldoc encoding
      perldoc Encode

I note in passing that the 3rd Edition of the Camel book was not a great
deal of help - it describes Perl 5.6.0 or thereabouts, and Unicode handling
has changed in some respects since then.

Beware Unicode - thank goodness for Encode!

--
Jonathan Leffler ([EMAIL PROTECTED])
STSM, Informix Database Engineering, IBM Data Management
4100 Bohannon Drive, Menlo Park, CA 94025
Tel: +1 650-926-6921   Tie-Line: 630-6921
      "I don't suffer from insanity; I enjoy every minute of it!"




|---------+---------------------------->
|         |           Tom Mornini      |
|         |           <[EMAIL PROTECTED]|
|         |           ia.com>          |
|         |                            |
|         |           06/17/2003 08:43 |
|         |           PM               |
|         |                            |
|---------+---------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                    
                                                         |
  |       To:       Jonathan Leffler/Menlo Park/[EMAIL PROTECTED]                      
                                                                 |
  |       cc:       [EMAIL PROTECTED], Tim Bunce <[EMAIL PROTECTED]>, Jay Konigsberg 
<[EMAIL PROTECTED]>                                     |
  |       Subject:  Re: DBI (1.32, 1.37) transforms data before passing it to the 
driver?  (XML and UTF-8 getting in the way?)                  |
  |                                                                                    
                                                         |
  
>---------------------------------------------------------------------------------------------------------------------------------------------|




I *just* tracked down a similar problem in our system.

It turns out that we were receiving data from SOAP::Lite that Perl had
marked internally as UTF-8, even though there were NO characters that
required it.

Another value that was completely UTF-8 clear was being *magically*
transformed by a Perl join operation, causing a single \xa3 to become
\x{c2a3} when Perl internally upgraded the single byte string into a
UTF-8 for joining with the UTF-8 string.

Apparently this operation caused an ISO Latin-1 to UTF-8 conversion to
take place. :-)

Aren't transitions to multibyte encodings wonderful?

Our fix was to downgrade the UTF-8 string before the join.

I simplest way I found to do this was:

$string = pack('H*',unpack('H*',$string));

Devel::Peek as my *friend* on this one!

We're using 5.6.1 so downgrading will not help this situation. In fact,
it's not really much of a situation at all, as everything is working
'correctly'.

On Tuesday, June 17, 2003, at 04:44 PM, Jonathan Leffler wrote:

> Dear Tim (and Jay - there's new info here, Jay),
>
> Jay Konigsberg originally approached me with a problem whereby an
> o-umlaut
> character in some data was being transformed into a  two bytes with
> different codes.  After paring his initial 800-line reproduction down
> to
> just 92 lines of code, I was able to remove DBD::Informix and replace
> it
> with DBD::NullP and demonstrate that the problem appeared there, too,
> and
> the problem seems to be in the DBI code itself.  However, it is not
> completely trivial; the reproduction still requires (seems to require)
> XML::Parser::PerlSAX to have handled the data first.  Simply sucking
> the
> data in from a file and then passing it through DBI does not seem to
> trigger this reaction.  The string passed as a parameter to
> $sth->execute()
> prints the unmodified value both before and after $sth->execute(),
> which
> really has me puzzled.  And it is not just o-umlaut that gets mapped;
> other
> characters such as a-acute, a-grave, e-acute, e-grave, A-acute,
> A-grave,
> E-acute, E-grave and y-umlaut also get trampled similarly.  I've
> diagnosed
> that the problem is in DBI because when run with PERL_DBI_DEBUG=2, the
> entry for '-> execute for DBD::NullP::st (...)' shows the modified
> string
> -- the transformation is certainly happening before DBD::NullP gets to
> see
> it (and before DBD::Informix sees it either).
>
> Jay is using Perl 5.8.0 on AIX 4.3.3 compiled with GCC 2.7.x; I'm using
> Perl 5.8.0 compiled on Solaris 7 with GCC 3.1 but now running on
> Solaris 8
> using GCC 3.3.  Jay is using DBI 1.32; I am using DBI 1.37.  I had to
> force
> install libxml-perl 0.07 this morning because one test failed.  I am
> up to
> date within a day or so on almost all the modules I have installed - I
> did
> an update with CPANPLUS this morning (DBD::ODBC and DBD:: Multiplex
> are out
> of date, though CPANPLUS says I've got D::M 0.90 installed and need to
> install D::M 0.90, which has me confused).
>
> Here's the test script - I'm not sure how much more it can be
> compressed.
> It needs the file jknullp.xml, which contains all the accented
> characters I
> mentioned.
>
> Is there a possibility that the XML stuff is somehow setting up the
> Perl
> Unicode system so that the Unicode thinks the characters should be
> recoded
> from ISO 8859-1 (as explicitly stated in the XML file) and is UTF-8
> encoding them?   Let's see: the input character codes are:
>
> Name        8859-1      DBI trace         UTF-8
> o-umlaut    0xF6        0xC3 0xB6         0xC3 0xB6
> a-grave     0xE0        0xC3 0xA0         0xC3 0xA0
> a-acute     0xE1        0xC3 0xA1         0xC3 0xA1
> A-grave     0xC0        0xC3 0x2E *       0xC3 0x80
> A-acute     0xC1        0xC3 0x2E *       0xC3 0x81
> E-grave     0xC8        0xC3 0x2E *       0xC3 0x88
> E-acute     0xC9        0xC3 0x89         0xC3 0x89
> e-grave     0xE8        0xC3 0xA8         0xC3 0xA8
> e-acute     0xE9        0xC3 0xA9         0xC3 0xA9
> y-umlaut    0xFF        0xC3 0xBF         0xC3 0xBF
>
> Except for the three starred characters, the DBI trace is showing a
> valid
> mapping from ISO 8859-1 to UTF-8.  The three starred characters are
> invalid
> UTF-8 sequences; the second byte should start with bits 10 to be valid.
>
> Any ideas on how to prevent this transformation from occurring?  Is
> reversion to Perl 5.6.1 the answer?  (Ugh if it is).  Or will 5.8.1 fix
> this?  Or is it something that should not be fixed?  But then how does
> a
> person parsing XML deal with this?  Or is it a property of the
> particular
> XML parser that Jay is using?
>
> HELP!!!
>
> (See attached file: jknullp.tgz)
>
> The tar file contains jknullp.pl (the Perl script), jknullp.trace (the
> output from running jknullp.pl on Solaris 8), and jknullp.xml (the XML
> source with accented characters in ISO 8859-1, as noted in the XML
> encoding
> information).  They all unpack into the current directory.
>
> --
> Jonathan Leffler ([EMAIL PROTECTED])
> STSM, Informix Database Engineering, IBM Data Management
> 4100 Bohannon Drive, Menlo Park, CA 94025
> Tel: +1 650-926-6921   Tie-Line: 630-6921
>       "I don't suffer from insanity; I enjoy every minute of it!"
> <jknullp.tgz>

RE: DBI (1.32, 1.37) transforms data before passing it to the driver? (XML and UTF-8 getting in the way?)

Reply via email to