El día viernes, octubre 11, 2019 a las 04:03:31p. m. -0600, Jon Jensen escribió:
> > is caused by missing UTF-8 on STDOUT. But, the line with > > > >>> HexStr: 50e464616 ... > > > > is not and shows that the \xe4 is there. Why? > > Perl's internal storage of string data is a little odd. \xe4 is the > correct Unicode code point as per: > > https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29 > > It is not UTF-8 encoded, true, but there's no reason Perl internally needs > to use UTF-8 specifically, and I believe for Latin-1 it does not by > default. It's a question of in-memory storage and processing (some kind of > Unicode) vs. input/output (where you want UTF-8). > > If your script is configured to send UTF-8 to STDOUT, then I would expect > that \xe4 will show up as the UTF-8 \xc3\xa4 instead. The byte \xe4 is not UTF-8. The Unicode Codepoint for the letter Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (U+00E4) can be seen here: http://www.fileformat.info/info/unicode/char/00E4/index.htm and must be \xc3\xa4. One can see this also on any UNIX shell: $ echo ä | od -tx1 0000000 c3 a4 0a and if you convert it to ISO-8859-1 then you will get \xe4: $ echo ä | iconv -f utf-8 -t iso-8859-1 | od -tx1 0000000 e4 0a 0000002 I learned meanwhile how to dump strings in Perl with Devel::Peek and this shows for the column coming out of PostgreSQL: ... Dump $string; gives for this case: SV = PVIV(0x386c3d0) at 0x2429050\n REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) IV = 2 PV = 0x39f6aa0 "P\303\244dagogische Hochschule Weingarten"\0 [UTF8 "P\x{e4}dagogische Hochschule Weingarten"] CUR = 35 LEN = 37 COW_REFCNT = 1 i.e. from the database PG server is coming the code point correctly as (octal) \303\244 which is the same as \xc3\xa4. And Perl mangles this to [UTF8 "P\x{e4}dagogische Hochschule Weingarten"] which is IMHO not correct and causing all this confusion. We have to deal with this in our perl code. It's not a PostrgreSQL problem. Thanks in any case for your attention to this case. matthias -- Matthias Apitz, ✉ [email protected], http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub 3. Oktober! Wir gratulieren! Der Berliner Fernsehturm wird 50 aus: https://www.jungewelt.de/2019/10-02/index.php
signature.asc
Description: PGP signature
