On Fri, 11 Oct 2019, Matthias Apitz wrote:

The resulting column contains ISO 8859-1 data:

HexStr: 50e46461676f67697363686520486f6368736368756c65205765696e67617274656e2020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020 P<E4>dagogische Hochschule Weingarten

That output is from a print statement going to STDOUT, so I wonder, are you already telling Perl that you want UTF-8 output with something like this at the beginning of your script?

use open qw( :std :utf8 );

Otherwise Perl may be defaulting to writing out Latin-1.

Hi Jon,

Of course the line

P<E4>dagogische Hochschule Weingarten

is caused by missing UTF-8 on STDOUT. But, the line with

HexStr: 50e464616 ...

is not and shows that the \xe4 is there. Why?

Perl's internal storage of string data is a little odd. \xe4 is the correct Unicode code point as per:

https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29

It is not UTF-8 encoded, true, but there's no reason Perl internally needs to use UTF-8 specifically, and I believe for Latin-1 it does not by default. It's a question of in-memory storage and processing (some kind of Unicode) vs. input/output (where you want UTF-8).

If your script is configured to send UTF-8 to STDOUT, then I would expect that \xe4 will show up as the UTF-8 \xc3\xa4 instead.

Jon


--
Jon Jensen
End Point Corporation
https://www.endpoint.com/

Reply via email to