Re: need help with utf-8

Felipe Gasper via dbi-users Wed, 18 Dec 2024 05:54:21 -0800

Do we know, in fact, why this changed?

The new behaviour may be “more correct”, but it’ll still subtly break a bunch 
of stuff that worked fine before.


Encoding bugs in Perl are notoriously hard to track down. DBD::Pg is popular; 
it would be good to know exactly why this happened so that others could 
proactively adjust their code accordingly.

Also, I recommend my Unicode/UTF-8 talk on this topic, particularly the “use 
utf8” section starting at about 9m30s and again around 20m20s: 
https://www.youtube.com/watch?v=yH5IyYyvWHU

-FG


> On Dec 18, 2024, at 12:12 AM, Dan Book <gri...@gmail.com> wrote:
> 
> Indeed, how strings work has not changed, but DBD::Pg's interpretation of 
> your strings probably did; the new behavior is more "correct" and now that 
> you are sending it decoded Unicode characters you may avoid other mysterious 
> issues. (Note that DBI itself does not handle strings, it just provides the 
> interface, DBD::Pg defines how strings are send to and from the database)
> 
> -Dan
> 
> On Tue, Dec 17, 2024 at 11:09 PM Shaomei Liu <sliu.newjer...@gmail.com> wrote:
> Dear Dan, Mark, Felipe, Alexander,
> Thank you all for your valuable feedback!
> as I replied Dan yesterday, this is my first time to ask for support from a 
> mailing list. I was very surprised and happy to get answers so quickly!
> I added "use utf8;" as suggested by Dan and it worked for my test program 
> shown in the email, but not for project. 
> then I tried decode as suggested by Dan and it worked for both test program 
> and project. so issue solved for me!!! 
> perl version is 5.26.3 and 5.16.3 on EL8 and EL7 respectively.
> DBI version is 1.641 and 1.627 on EL8 and EL7 respectively.
> 
> here is the test program with decode. I also printed length. I thought it is 
> a perl thing. but the length is the same on EL8 and EL7. so not sure it is 
> perl or DBI change causing the issue.
> xxx.com> cat testutf_decode.pl
> #!/usr/bin/perl
> use strict;
> use warnings;
> use DBI;
> use Encode 'decode';
> print "DBI version: $DBI::VERSION\n";
> 
> my $db = "debugutf";
> my $host = "db";
> my $user = "postgres";
> my $pass = "";
> my $dbh = DBI->connect("DBI:Pg:dbname=$db;host=$host",$user,$pass);
> my $sql = 'INSERT INTO table1 (title) VALUES (?)';
> my $query = $dbh->prepare($sql);
> my $bytes = '“';
> my $chars = decode('UTF-8', $bytes);
> print "$bytes contains ".length($bytes)." characters\n";
> print "after decode $bytes contains ".length($chars)." characters\n";
> #my @values = ($bytes); #=======>with this line, Database shows “ on EL7 but 
> â\u0080\u009C on EL8
> my @values = ($chars);  #======>Database shows “ on both EL8 and EL7, so 
> decode fixed the issue
> $query->execute(@values); 
> 
> xxx.com> ./testutf_decode.pl  #running on EL8
> DBI version: 1.641
> “ contains 3 characters
> after decode “ contains 1 characters
> 
> xxx.com> ./testutf_decode.pl #running on EL7
> DBI version: 1.627
> “ contains 3 characters
> after decode “ contains 1 characters
> 
> Thank you!!
> Shirley
> 
> On Tue, Dec 17, 2024 at 3:30 PM Alexander Foken via dbi-users 
> <dbi-users@perl.org> wrote:
> Hi,
> DBD::ODBC has several tests related to Unicode handling 
> (40UnicodeRoundTrip.t, 41Unicode.t, 45_unicode_varchar.t), they should also 
> work with other DBDs. They should tell you if your problem is between Perl 
> and Postgres or if it is simply in the encoding of your terminal.
> Alexander
> On 17.12.2024 13:31, Felipe Gasper via dbi-users wrote:
>> Respectfully to Dan & others, I don’t advocate adding “use utf8” to existing 
>> code without a clear understanding of where your program’s decode & encode 
>> points are.
>> 
>> Check to see what DBD::Pg actually writes to the database. If it suddenly 
>> started encoding, that’s a breaking change that either was documented or 
>> should be reported upstream.
>> 
>>> On Dec 16, 2024, at 17:13, Shaomei Liu <sliu.newjer...@gmail.com> wrote:
>>> 
>>>  Hello,
>>> very happy to find this mailing list as it is my last resort!! 
>>> I have a project which uses DBI to write to postgres DB.
>>> after upgrading from RHEL7 to RHEL8, the utf-8 character is not displayed 
>>> properly in the DB. DB has correct utf-8 encoding set.
>>> for example, left double quotation mark   “  is displayed as â\u0080\u009C.
>>> You can use this link to check hex utf-8 bytes
>>> https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%9C&mode=char
>>> 
>>> below is the file testutf.pl which writes left double quotation mark  “ to 
>>> the database. it also shows the query results from psql for both EL8 and 
>>> EL7.
>>> 
>>> ==========file testutf.pl==========
>>> #!/usr/bin/perl
>>> use strict;
>>> use warnings;
>>> use DBI;
>>> print "DBI version: $DBI::VERSION\n";
>>> 
>>> my $db = "debugutf";
>>> my $host = "db";
>>> my $user = "postgres";
>>> my $pass = "";
>>> my $dbh = DBI->connect("DBI:Pg:dbname=$db;host=$host",$user,$pass);
>>> my $sql = 'INSERT INTO table1 (title) VALUES (?)';
>>> my $query = $dbh->prepare($sql);
>>> my @values = ('“');
>>> $query->execute(@values);
>>> ===================================
>>> 
>>> ==============on RHEL8
>>> #execute testutf.pl which wrote “ to database on RHEL8
>>> text.tac1.dev.bia-boeing.com> ./testutf.pl
>>> DBI version: 1.641
>>> 
>>> #from psql
>>> debugutf=# select * from table1;
>>>      title
>>> ---------------
>>>  â\u0080\u009C  =========>unexpected
>>> (1 row)
>>> 
>>> 
>>> ==============on RHEL7
>>> #execute testutf.pl which wrote “ to database on RHEL8
>>> text.tac1.dev.bia-boeing.com> ./testutf.pl
>>> DBI version: 1.627
>>> 
>>> #from psql
>>> debugutf=# select * from table1;
>>>      title
>>> ---------------
>>>  “       ============>expected
>>> (1 row)
>>> 
>>> Any feedback is appreciated.
>>> thank you
>>> Shirley
> -- 
> Alexander Foken
> mailto:alexan...@foken.de

Re: need help with utf-8

Reply via email to