On Thu, Nov 04, 2004 at 01:42:13PM -0800, Susan Cassidy wrote:
> I finally got my large, complex cgi/Oracle application working with
> DBD::Oracle 1.16, using database character set AL32UTF8, NLS_LANG=.UTF8,
> etc.

And what are the _client_ CHAR and NCHAR character sets?
And is the field you're inserting into a CHAR or NCHAR?

> The test program takes some English sentences, runs them through a
> translator (which produces utf8 output, works fine - data validates as utf8
> on multiple systems, etc.).

It's important to keep in mind that "validates as utf8" is ambiguous.

It could mean *either or both* of:

    a) the sequence of is a valid utf8 encoding.
    b) the perl scalar value has the perl SvUTF8 flag turned on.

Much confusion is caused by not keeping those two separate points
in mind. It's important to be clear what you're thinking about,
and precise when communicating it to others.

> I then insert it into the database, and retrieve it.  The retrieved
> data did not match the translated data.

I'm afraid that "The retrieved data did not match the translated
data" is another ambiguous statement.

If a sequence of bytes that does not have the SvUTF8 flag turned
on is compared with the same sequence of bytes that does, they won't
match (unless the string is all ASCII).

Perl will encode the sequence of bytes that does not have the SvUTF8
flag turned on into UTF8 by treating each byte as a Latin1 character
(by default). If the sequence of bytes was UTF8 encoded already
(but not marked with the SvUTF8 flag) then treating each byte as a
Latin1 character will produce garbage unless the string is all ASCII.

So the two strings with the same sequence of bytes may not match!

> I added some tests in the code to check on the translated value like:  
>     if (Encode::is_utf8($textval)) {
>       print "<p>&nbsp;is utf8!\n";
>     } else {
>       print "<p>&nbsp;is NOT utf8\n";
>     }
> This prints "is NOT utf8"  (when I know that it really is utf8).

Do you know which out of A and B above Encode::is_utf8 actually tests for?
Do you know which out of A and B you mean by "it really is utf8"?

> If I do the same thing to the retrieved data, it prints that the data IS
> utf8.

The returned data will be both valid utf8 and have the SvUTF8 flag on
if your relevant (CHAR/NCHAR) client character set is UTF8 or AL32UTF8.

But that doesn't mean it contains the same string you passed in! :)
So I trust you're also checking if $inserted_value eq $fetched_value.

> However, if I turn off the utf8 flag explicitly after retrieving the data,
> before comparing the translated data with the retrieved data, it works:

Probably because you're now comparing byte strings as byte strings.

> Of course, where I print out the status of utf8 below this, it now says it
> is NOT utf8.

Of course.

> I have re-read the Encode perldoc stuff several times.  It seems to be
> working (on my system) backwards, sort of?
> 
> I the DBD::Oracle 1.16 docs, Tim says:
>       If the string passed to bind_param() is considered by perl to be a
>        valid utf8 string ( utf8::is_utf8($string) returns true ), then
>        DBD::Oracle will implicitly set csform SQLCS_NCHAR and csid AL32UTF8
>        for you on insert.
> So, I think this may have something to do with it.  However, I am
> "unset"ting it after retrieval, not before inserting it. ????

But was it actually set on the value you inserted?

[FYI, the output from trace() quotes strings with the SvUTF8 flag
on with double quotes, and uses single quotes if SvUTF8 is off.
That's a quick way to see what's going on.]

> By the way, the same program moved over to a different machine where we use
> PostgreSQL (DBD::Pg) (without the _utf8_off, of course)  works fine (as I
> would expect).

I suspect DBD::Pg is doing something wrong that just happens to
work for your view of how it ought to work. Of course, I may be wrong.

Tim.

Reply via email to