On Tue, 05 May 2020 10:53:29 +0200, Axel Beckert wrote: > > Perhaps the strings in wml need to be decoded from UTF-8 so that they > > aren't treated as a sequence of independent bytes? > ... and would have expect "use feature unicode_strings;" already > activates all of this.
(I haven't read the thread in detail …). Personally I often use "use utf8:all" (from libutf8-all-perl) if I'm reasonably sure that the input is not weird and I want to output utf-8. It is sometimes a bit slow but handles all the en/decoding in my experience. > > Explicitly using Encode helps: > > > > echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = > > decode_utf8($_); s|\s+\n|\n|sg; print }' > > Wide character in print at -e line 1, <> line 1. > > 包 % time echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }' Wide character in print at -e line 1, <> line 1. 包 echo 包 0.00s user 0.00s system 42% cpu 0.002 total perl -E 0.03s user 0.01s system 97% cpu 0.034 total % time echo 包 | perl -Mutf8::all -E ' while(<>) { s|\s+\n|\n|sg; print }' 包 echo 包 0.00s user 0.00s system 63% cpu 0.002 total perl -Mutf8::all -E ' while(<>) { s|\s+\n|\n|sg; print }' 0.04s user 0.01s system 98% cpu 0.050 total % time echo 包 | perl -CS -E 'while(<>) { s|\s+\n|\n|sg; print }' 包 echo 包 0.00s user 0.00s system 60% cpu 0.002 total perl -CS -E 'while(<>) { s|\s+\n|\n|sg; print }' 0.00s user 0.00s system 83% cpu 0.005 total Cheers, gregor -- .''`. https://info.comodo.priv.at -- Debian Developer https://www.debian.org : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D 85FA BB3A 6801 8649 AA06 `. `' Member VIBE!AT & SPI Inc. -- Supporter Free Software Foundation Europe `- BOFH excuse #378: Operators killed by year 2000 bug bite.