Re: Interpretation of non-UTF8 strings

Marcin 'Qrczak' Kowalczyk Tue, 24 Aug 2004 11:08:37 -0700

W liście z wto, 24-08-2004, godz. 20:20 +0300, Jarkko Hietaniemi
napisał:


> > that with double recoding, or with $ARGV[0] not being equivalent to
> > substr($ARGV[0], 0).
> 
> What substr() example you are referring to here?  I cannot find this
> in your recent messages.

$ perl -Mencoding=ISO-8859-2 -Mopen=:encoding\(ISO-8859-2\) -e '
eval {open F, "/etc/shadow"};
print "$ARGV[0]\n", substr($ARGV[0], 0), "\n"' Ą
Ą
"\x{00a1}" does not map to iso-8859-2 at -e line 1.
\x{00a1}

$ perl -Mencoding=ISO-8859-2 -Mopen=:encoding\(ISO-8859-2\) -e '
eval {open F, "/etc/shadow"}; print "$!\n", substr($!, 0), "\n"'
Brak dostępu
"\x{00ea}" does not map to iso-8859-2 at -e line 1.
Brak dost\x{00ea}pu

> > I hope the -C flag is considered a temporary hack, to be eventually
> > replaced with somethings which supports other encodings and not only
> > UTF-8.
> 
> Possibly.  It was an explicit solution for much greater brokenness
> that resulting from assuming implicit UTF-8 from locales.

What breaks? Maybe the problem is that Perl doesn't distinguish strings
of text from arrays of bytes. Do people expect that print chr(255)
outputs a single byte? It will not work when the stdout encoding is
UTF-8 no matter what.

If someone works in a mostly-UTF-8 environment, he probably expects
stdout to be treated as UTF-8 text by default, which implies that he
must use some other means for outputting raw bytes. Maybe syswrite.
Similarly for input and file contents in general.

> > use encoding files => "ISO-8859-2";
> > use encoding terminal => "UTF-8";
> 
> What do you mean by "terminal"?  The STD* streams or /dev/tty?

Nothing precise; it is yet to be decided what classes of "places" which
need recoding should be distinguished. Perhaps one switch for whole IO
is enough. Or maybe only STD* streams should be separated.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Interpretation of non-UTF8 strings

Reply via email to