On Tue, 2021-10-19 at 11:34 +1100, Daniel Kasak via gtk-perl-list wrote: > Right. I found a hack on https://perldoc.perl.org/perlunicode ( which > you directed me to ) that appears to have fixed *this* particular > issue ( though it's not clear what I've then broken as a result ) > Calling: > > Encode::_utf8_on($_) > > ... for every value just prior to being pushed into the model > appears to work. Yay :)
The operative word here is "appears". This hack will work for most characters but not all. The general advice for working with encodings from Perl is that you should: * decode bytes on input to give you strings in Perl's internal representation which supports multi-byte characters; and * encode strings to bytes in a particular encoding on output These days the most common encoding you will encounter is UTF-8. To do the relevant decoding of a UTF-8 file you might open it like this: open(my $fh, '<:utf8', $filename); Or, if the string was not read from a file but was simply defined in your script, you would tell Perl to decode the bytes of your script from UTF-8 by including this pragma: use utf8; For output to a file you might use: open(my $fh, '>:utf8', $filename); Your experience seems to suggest that the Perl Gtk bindings will do the right thing when presented with a string that has the internal "utf8" flag set. But if your string has non-ASCII characters but does not already have that flag set then it seems the decoding step has been missed. Data that came from a DB connection rather than a file might need to be decoded with something like: $perl_string = Encode::decode_utf8($db_string); However most of the DBD drivers allow you to set a flag so that this happens automatically. The reason messing with the utf8 flag on the Perl string appears to work is that Perl's internal encoding is almost-but-not-quite UTF-8. For historical reasons (and arguably as an memory optimisation) sometimes Perl will encode some characters in the range 0x80-0xFF as a single byte ("Latin-1" encoding) rather than the two bytes that UTF-8 would require. For example chr(0x20AC) would return a Perl string which was represented in memory using UTF-8 bytes. Whereas chr(0xE9) would return a Perl string which was represented in memory using a single Latin-1 byte. Simply setting the utf8 flag on the first string would do no harm (since it's already set) but it would make a mess of the second string because it's only one byte long and not a valid UTF-8 sequence. If you really want to understand this stuff here's a link to a conference talk I did on the subject: https://www.youtube.com/watch?v=cgswnneFp-s Regards Grant _______________________________________________ gtk-perl-list mailing list gtk-perl-list@gnome.org https://mail.gnome.org/mailman/listinfo/gtk-perl-list