thank you, Grant! thanks for sharing the video!! liked your rule of thumb! Shirley
On Wed, Dec 18, 2024 at 9:01 PM Grant McLean <[email protected]> wrote: > Hello Shirley > > This is a complex topic which includes "encodings" and the way in which > Perl is able to deal with binary data > (e.g.: a string of bytes) vs character data (in which each character might > be represented by one or more bytes). > > There is no "quick fix". You really need to understand when you're dealing > with bytes vs characters. As a > rule of thumb you'll want to "decode" data that is coming into your > program and "encode" data that is > being output to the world (e.g.: to a file or in a web page response). > > If you're prepared to make the effort to understand, here's a link to a > video I made on the subject: > > https://www.youtube.com/watch?v=cgswnneFp-s > > There are a number of reasons why the behaviour of your code might have > changed following the upgrade. > - The newer versions of libraries and utilities might have different > defaults for handling bytes vs character data. > - The "locale" setting in the upgraded system might be different (e.g.: > LANG="C" vs LANG="en_US.UTF-8"). > - The environment in which the code executes might be different. > > Well-written code that is explicit about handling character data and where > the encoding/decoding should happen > would be resistant to those types of outside influences. > > In the video I walk through a scenario where some code which appeared to > be working correctly but then one small > change broke things in different ways. The fixes are to add in explicit > handling of encoding. > > However this is not really an issue that is specific to DBI or DBD::Pg - > apart from being explicit about your use of the > "pg_enable_utf8" attribute on your database handle: > > https://metacpan.org/pod/DBD::Pg#pg_enable_utf8-(integer) > > I hope that sets you on the right path. > > Regards > Grant McLean > > On Wed, 2024-12-18 at 16:33 -0500, Shaomei Liu wrote: > > send again after subscribing. > > On Wed, Dec 18, 2024 at 11:20 AM Shaomei Liu <[email protected]> > wrote: > > Hello, > I have a project which uses DBI to write to postgres DB. > after upgrading from RHEL7 to RHEL8, the utf-8 character is not displayed > properly in the DB. DB has correct utf-8 encoding set. > for example, left double quotation mark “ is displayed as â\u0080\u009C > . > with support from DBI community, the issue was solved by calling decode > from Encode module before writing to DB. > wondering what is the change from DBD::pg cause this issue. > > perl version is 5.26.3 and 5.16.3 on EL8 and EL7 respectively. > DBI version is 1.641 and 1.627 on EL8 and EL7 respectively. > > here is the program and execution results. > Any feedback are greatly appreciated! > thank you > Shirley > > xxx.com> cat testutf_decode.pl > #!/usr/bin/perl > use strict; > use warnings; > use DBI; > use Encode 'decode'; > print "DBI version: $DBI::VERSION\n"; > > my $db = "debugutf"; > my $host = "db"; > my $user = "postgres"; > my $pass = ""; > my $dbh = DBI->connect("DBI:Pg:dbname=$db;host=$host",$user,$pass); > my $sql = 'INSERT INTO table1 (title) VALUES (?)'; > my $query = $dbh->prepare($sql); > my $bytes = '“'; > my $chars = decode('UTF-8', $bytes); > print "$bytes contains ".length($bytes)." characters\n"; > print "after decode $bytes contains ".length($chars)." characters\n"; > #my @values = ($bytes); #=======>without decode, Database shows “ on EL7 > but â\u0080\u009C on EL8 > my @values = ($chars); #======>with decode, Database shows “ on both EL8 > and EL7, decode fixed the issue > $query->execute(@values); > > ############### running on EL8 > xxx.com> ./testutf_decode.pl > DBI version: 1.641 > “ contains 3 characters > after decode “ contains 1 characters > > [yyy.com]$ psql -Upostgres -hdb debugutf > psql (16.6) > debugutf=# select * from table1; > title > --------------- > â\u0080\u009C ==========>NOK without decode > “ =============>OK with decode, so decode fixed the issue > (2 rows) > > ############### running on EL7 > xxx.com> ./testutf_decode.pl > DBI version: 1.627 > “ contains 3 characters > after decode “ contains 1 characters > > [yyy.com]$ psql -Upostgres -hdb debugutf > psql (16.6) > debugutf=# select * from table1; > title > --------------- > “ =============>OK without decode > “ =============>OK with decode > (2 rows) > > >
