On Wednesday 09 November 2016 15:55:47 Gert Brinkmann wrote: > Hello, > ... > > This prints out the utf8 characters corrupted. You have to flag the > Variable after writing into it with Encode::_utf8_on() as utf8 to make > it work correctly. (So activate the commented line.) > > Using this _utf8_on() usually means that I am doing something wrong.
Yes, that is truth! You should never use _utf8_on/_utf8_off/is_utf8 functions! They are here *only* for dealing with buggy XS modules. Not for pure perl code... In pure perl code you must *not* care about UTF8 flag. > Is there a better way to achieve the correct behaviour? Of course! When you think that you need to use Encode::_utf8_on() then use utf8::decode() instead (or Encode::decode('UTF-8', ...)). Similarly utf8::encode (or Encode::encode('UTF-8, ...)) instead of Encode::_utf8_off(). > Btw. there was a change in the behaviour between perl v5.14.2 and > v5.20.2: In older perl versions you could do a > > my $html = ''; > Encode::_utf8_on($html); > > before opening the file handle onto this variable. In newer perl > versions the utf8 flag is reset on open() and print() to the variable's > file handle. UTF8 flag just indicate if internal encoding of perl scalar is Latin1 or UTF8. But it is internal any Latin1 string can be represented either in Latin1 (without UTF8 flag) or in UTF-8 (with UTF8 flag). You should not care about internal representation in pure perl code. Any perl function at any time can convert scalar between these two encoding if it is possible (for ASCII and Latin1 charsets). (Btw, on EBCDIC platforms, UTF8 flag indicate that internal encoding is UTFEBCDIC or EBCDIC, not UTF-8!!, so really do not depend on UTF8 flag!) And to your question, here is explanation of your source code: > ----------------------------------------------------- > use strict; > use utf8; Now source code is expected to be in utf8 and perl strings are treated as wide characters. > use Encode; > use FileHandle; > > binmode STDOUT, ":utf8"; Now printing to STDOUT handle accept wide characters (>= 0xFF) and convert output to utf8 octets. So your terminal should be configured to accept and show UTF-8 sequences correctly. > > my $html = ''; > > #-- open filehandle to write into the $html variable as utf8 > open(my $fh, '>:encoding(UTF-8)', \$html); Now printing to $fh accept wide characters and convert printed characters to utf8 octets before storing them to $html. It means that $html will *always* contains sequence of numbers which represent utf8 sequences. > my $orig_stdout = select( $fh ); > > > print "Ümläut Test ßaß; 使用下列语言\n"; Now you have string with wide characters and this print will send this string to $html. In $html you have sequence of octets which contains encoded form of that wide string. > > > select( $orig_stdout ); > $fh->close(); > > #You need to activate this line to make utf8 output correct > #Encode::_utf8_on($html); > > print $html; And now you send sequence of utf8 octets to STDOUT which expect wide characters those are converted to utf8 octets. So what you get is double encoded utf8 sequence. Now stop, and think about it why this is truth! > ----------------------------------------------------- Fix is really simple. Either decode utf8 octets in $html back to wide characters (via utf8::decode($html)) or tell STDOUT that it does not expect wide strings, but raw octets (= remove binmode STDOUT, ":utf8";) line. Again... think about it, why both my proposed fixes are working. Btw, perl does not use UTF-8 encoding, but perl's extended utf8. If you want strict UTF-8, use ":encoding(UTF-8)" layer. Layers ":utf8" or ":encoding(utf8)" (without hyphen) are those non-strict perl's extended utf8 encodings. Also utf8::encode/decode are non-stricts...