I'm glad you've found something that works.
But can someone summarise the causes/issues into something we can
all understand? [I don't have time to try to do that for myself.]
Obviously the current situation is not good. But I need to have a
better understanding of *exactly* what's going on at all levels
(app code and driver) before I can think much about what needs
fixing or documenting, and how.
Tim.
On Wed, Jun 18, 2003 at 01:20:45PM -0700, Jonathan Leffler wrote:
>
>
>
>
> Dear Tom, Tim, Jay,
>
> Sorry for the top posting - Lotus Notes makes anything else impossibly
> cumbersome.
>
> Thanks for the information, and I'm sorry you ran into the same problem. I
> tried your pack/unpack trick with Perl 5.8.0 and that does not seem to work
> -- it converted everything into UTF-8 somewhere along the line, which also
> was the behaviour I got in Perl 5.6.1 when I tried the original example
> code I sent.
>
> I tried a couple of other tricks, including 'use bytes;' -- don't; it bites
> :-)
>
> What seems to work for me -- at least, the notation is perfectly explicit
> -- is:
>
> use Encode;
> ...
> $action_note = encode("iso-8859-1", $action_note);
> $sth->execute($action_note);
>
> There's a nice CAVEAT in the documenation: 'When you run "$octets =
> encode("utf8", $string)", then $octets may not be equal to $string. Though
> they both contain the same data, the utf8 flag for $octets is always off.
> When you encode anything, the utf8 flag of the result is always off, even
> when it contains completely valid utf8 string.'
>
> In the example code sent y'day, I had to place the "use Encode;" after the
> "package MyHandler;" or explicitly qualify the call with
> Encode::encode(...). I fear that this would be necessary on all strings
> manipulated by XML if they contain non-ASCII characters. I suspect that
> the use of map would help with variables in bulk:
>
> @array = map { encode("iso-8859-1", $_) } @array;
> ($var1, $var2, $var3) = map { encode("iso-8859-1", $_) $var1, $var2,
> $var3;
>
> The sequence of manual bashing that led to the plausibly correct solution
> was:
>
> perldoc perlunicode
> perldoc encoding
> perldoc Encode
>
> I note in passing that the 3rd Edition of the Camel book was not a great
> deal of help - it describes Perl 5.6.0 or thereabouts, and Unicode handling
> has changed in some respects since then.
>
> Beware Unicode - thank goodness for Encode!
>
> --
> Jonathan Leffler ([EMAIL PROTECTED])
> STSM, Informix Database Engineering, IBM Data Management
> 4100 Bohannon Drive, Menlo Park, CA 94025
> Tel: +1 650-926-6921 Tie-Line: 630-6921
> "I don't suffer from insanity; I enjoy every minute of it!"
>
>
>
>
> |---------+---------------------------->
> | | Tom Mornini |
> | | <[EMAIL PROTECTED]|
> | | ia.com> |
> | | |
> | | 06/17/2003 08:43 |
> | | PM |
> | | |
> |---------+---------------------------->
>
> >---------------------------------------------------------------------------------------------------------------------------------------------|
> |
> |
> | To: Jonathan Leffler/Menlo Park/[EMAIL PROTECTED]
> |
> | cc: [EMAIL PROTECTED], Tim Bunce <[EMAIL PROTECTED]>, Jay Konigsberg
> <[EMAIL PROTECTED]> |
> | Subject: Re: DBI (1.32, 1.37) transforms data before passing it to the
> driver? (XML and UTF-8 getting in the way?) |
> |
> |
>
> >---------------------------------------------------------------------------------------------------------------------------------------------|
>
>
>
>
> I *just* tracked down a similar problem in our system.
>
> It turns out that we were receiving data from SOAP::Lite that Perl had
> marked internally as UTF-8, even though there were NO characters that
> required it.
>
> Another value that was completely UTF-8 clear was being *magically*
> transformed by a Perl join operation, causing a single \xa3 to become
> \x{c2a3} when Perl internally upgraded the single byte string into a
> UTF-8 for joining with the UTF-8 string.
>
> Apparently this operation caused an ISO Latin-1 to UTF-8 conversion to
> take place. :-)
>
> Aren't transitions to multibyte encodings wonderful?
>
> Our fix was to downgrade the UTF-8 string before the join.
>
> I simplest way I found to do this was:
>
> $string = pack('H*',unpack('H*',$string));
>
> Devel::Peek as my *friend* on this one!
>
> We're using 5.6.1 so downgrading will not help this situation. In fact,
> it's not really much of a situation at all, as everything is working
> 'correctly'.
>
> On Tuesday, June 17, 2003, at 04:44 PM, Jonathan Leffler wrote:
>
> > Dear Tim (and Jay - there's new info here, Jay),
> >
> > Jay Konigsberg originally approached me with a problem whereby an
> > o-umlaut
> > character in some data was being transformed into a two bytes with
> > different codes. After paring his initial 800-line reproduction down
> > to
> > just 92 lines of code, I was able to remove DBD::Informix and replace
> > it
> > with DBD::NullP and demonstrate that the problem appeared there, too,
> > and
> > the problem seems to be in the DBI code itself. However, it is not
> > completely trivial; the reproduction still requires (seems to require)
> > XML::Parser::PerlSAX to have handled the data first. Simply sucking
> > the
> > data in from a file and then passing it through DBI does not seem to
> > trigger this reaction. The string passed as a parameter to
> > $sth->execute()
> > prints the unmodified value both before and after $sth->execute(),
> > which
> > really has me puzzled. And it is not just o-umlaut that gets mapped;
> > other
> > characters such as a-acute, a-grave, e-acute, e-grave, A-acute,
> > A-grave,
> > E-acute, E-grave and y-umlaut also get trampled similarly. I've
> > diagnosed
> > that the problem is in DBI because when run with PERL_DBI_DEBUG=2, the
> > entry for '-> execute for DBD::NullP::st (...)' shows the modified
> > string
> > -- the transformation is certainly happening before DBD::NullP gets to
> > see
> > it (and before DBD::Informix sees it either).
> >
> > Jay is using Perl 5.8.0 on AIX 4.3.3 compiled with GCC 2.7.x; I'm using
> > Perl 5.8.0 compiled on Solaris 7 with GCC 3.1 but now running on
> > Solaris 8
> > using GCC 3.3. Jay is using DBI 1.32; I am using DBI 1.37. I had to
> > force
> > install libxml-perl 0.07 this morning because one test failed. I am
> > up to
> > date within a day or so on almost all the modules I have installed - I
> > did
> > an update with CPANPLUS this morning (DBD::ODBC and DBD:: Multiplex
> > are out
> > of date, though CPANPLUS says I've got D::M 0.90 installed and need to
> > install D::M 0.90, which has me confused).
> >
> > Here's the test script - I'm not sure how much more it can be
> > compressed.
> > It needs the file jknullp.xml, which contains all the accented
> > characters I
> > mentioned.
> >
> > Is there a possibility that the XML stuff is somehow setting up the
> > Perl
> > Unicode system so that the Unicode thinks the characters should be
> > recoded
> > from ISO 8859-1 (as explicitly stated in the XML file) and is UTF-8
> > encoding them? Let's see: the input character codes are:
> >
> > Name 8859-1 DBI trace UTF-8
> > o-umlaut 0xF6 0xC3 0xB6 0xC3 0xB6
> > a-grave 0xE0 0xC3 0xA0 0xC3 0xA0
> > a-acute 0xE1 0xC3 0xA1 0xC3 0xA1
> > A-grave 0xC0 0xC3 0x2E * 0xC3 0x80
> > A-acute 0xC1 0xC3 0x2E * 0xC3 0x81
> > E-grave 0xC8 0xC3 0x2E * 0xC3 0x88
> > E-acute 0xC9 0xC3 0x89 0xC3 0x89
> > e-grave 0xE8 0xC3 0xA8 0xC3 0xA8
> > e-acute 0xE9 0xC3 0xA9 0xC3 0xA9
> > y-umlaut 0xFF 0xC3 0xBF 0xC3 0xBF
> >
> > Except for the three starred characters, the DBI trace is showing a
> > valid
> > mapping from ISO 8859-1 to UTF-8. The three starred characters are
> > invalid
> > UTF-8 sequences; the second byte should start with bits 10 to be valid.
> >
> > Any ideas on how to prevent this transformation from occurring? Is
> > reversion to Perl 5.6.1 the answer? (Ugh if it is). Or will 5.8.1 fix
> > this? Or is it something that should not be fixed? But then how does
> > a
> > person parsing XML deal with this? Or is it a property of the
> > particular
> > XML parser that Jay is using?
> >
> > HELP!!!
> >
> > (See attached file: jknullp.tgz)
> >
> > The tar file contains jknullp.pl (the Perl script), jknullp.trace (the
> > output from running jknullp.pl on Solaris 8), and jknullp.xml (the XML
> > source with accented characters in ISO 8859-1, as noted in the XML
> > encoding
> > information). They all unpack into the current directory.
> >
> > --
> > Jonathan Leffler ([EMAIL PROTECTED])
> > STSM, Informix Database Engineering, IBM Data Management
> > 4100 Bohannon Drive, Menlo Park, CA 94025
> > Tel: +1 650-926-6921 Tie-Line: 630-6921
> > "I don't suffer from insanity; I enjoy every minute of it!"
> > <jknullp.tgz>
>
>
>
>