El día viernes, octubre 11, 2019 a las 04:03:31p. m. -0600, Jon Jensen escribió:

> > is caused by missing UTF-8 on STDOUT. But, the line with
> >
> >>> HexStr: 50e464616 ...
> >
> > is not and shows that the \xe4 is there. Why?
> 
> Perl's internal storage of string data is a little odd. \xe4 is the 
> correct Unicode code point as per:
> 
> https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29
> 
> It is not UTF-8 encoded, true, but there's no reason Perl internally needs 
> to use UTF-8 specifically, and I believe for Latin-1 it does not by 
> default. It's a question of in-memory storage and processing (some kind of 
> Unicode) vs. input/output (where you want UTF-8).
> 
> If your script is configured to send UTF-8 to STDOUT, then I would expect 
> that \xe4 will show up as the UTF-8 \xc3\xa4 instead.

The byte \xe4 is not UTF-8. The Unicode Codepoint for the letter

Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (U+00E4)

can be seen here:

http://www.fileformat.info/info/unicode/char/00E4/index.htm

and must be \xc3\xa4. One can see this also on any UNIX shell:

$ echo ä | od -tx1
0000000    c3  a4  0a

and if you convert it to ISO-8859-1 then you will get \xe4:

$ echo ä | iconv -f utf-8 -t iso-8859-1 | od -tx1
0000000    e4  0a
0000002

I learned meanwhile how to dump strings in Perl with Devel::Peek and this
shows for the column coming out of PostgreSQL:

...
Dump $string;

gives for this case:

SV = PVIV(0x386c3d0) at 0x2429050\n  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  IV = 2
  PV = 0x39f6aa0 "P\303\244dagogische Hochschule Weingarten"\0 [UTF8 
"P\x{e4}dagogische Hochschule Weingarten"]
  CUR = 35
  LEN = 37
  COW_REFCNT = 1

i.e. from the database PG server is coming the code point correctly as
(octal) \303\244 which is the same as \xc3\xa4. And Perl mangles this to 

[UTF8 "P\x{e4}dagogische Hochschule Weingarten"]

which is IMHO not correct and causing all this confusion.

We have to deal with this in our perl code. It's not a PostrgreSQL
problem.

Thanks in any case for your attention to this case.

        matthias

-- 
Matthias Apitz, ✉ [email protected], http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

3. Oktober! Wir gratulieren! Der Berliner Fernsehturm wird 50 
aus: https://www.jungewelt.de/2019/10-02/index.php

Attachment: signature.asc
Description: PGP signature

Reply via email to