Re: Interpretation of non-UTF8 strings

Jarkko Hietaniemi Tue, 24 Aug 2004 10:20:17 -0700

> Portability is not a sufficient excuse though. There are bugs, like

That's right, we haven't fixed things because we are lazy and stupid.
How did you guess?


> that with double recoding, or with $ARGV[0] not being equivalent to
> substr($ARGV[0], 0).

What substr() example you are referring to here?  I cannot find this
in your recent messages.

> The API is, I'm afraid, not good enough, even if we ignore the old mode
> of manipulating data in its external encoding. Namely, it doesn't
> distinguish specifying the encoding of the script source (which depends
> on where it has been written) from specifying the encoding that the
> script should assume on STDIN/STDOUT/STDERR and other places (which
> depends on where it is being run). Well, other places when implemented,
> assuming it will be indeed triggered by the 'encoding' pragma.

You may consider the encoding pragma broken for your uses, and that is
fine, but I have to point out that many people are happily using it.

If your environment is such that your script is in encoding X and
your utilities operate in encoding X, all is fine.  It's when you mix
encodings when things get murkier.

Take for example the output of qx(): you may declare somehow that it is
in UTF-8, but the moment some utility behaves differently and spits out
Latin-1 or Latin-2 or SJIS, you are screwed.

> I hope the -C flag is considered a temporary hack, to be eventually
> replaced with somethings which supports other encodings and not only
> UTF-8.

Possibly.  It was an explicit solution for much greater brokenness that
resulting from assuming implicit UTF-8 from locales.

> use encoding files => "ISO-8859-2";
> use encoding terminal => "UTF-8";

What do you mean by "terminal"?  The STD* streams or /dev/tty?

> use encoding filenames => "ISO-8859-1";
> use encoding env => "locale";

Something like that would be nice, yes.  Someone needs to implement it,
though, and that's the problem.

> We should think how it interacts with locale-aware behavior of
> functions. Without 'use locale' and other pragmas it's clear: Perl
> consistently assumes that every text is ISO-8859-1. When something like

Well, no.  In that case Perl assumes that everything is in whatever
8-bit encoding the platform happens to be using, with the exception that
/\w/ and so forth only implement the character set of ASCII (in effect,
the raw underlying <ctype.h> API).

> 'use encoding' is in effect, Perl still interprets the scalars in the
> same way, but treats them differently when they interact with the world.
> 
> But with 'use locale' it assumes that non-UTF-8 scalars are in the
> current locale encoding, which is incompatible with the assumptions
> taken when UTF-8 scalars and non-UTF-8 scalars are mixed. So it will
> probably never work together. If 'use locale' includes some essential
> features besides the treatment of texts, like date/time formatting,
> it should be available by other means, without at the same time causing
> ord(lc(chr(161))) to be equal to 177, which doesn't make sense if
> character codes are interpreted according to Unicode. It implies
> that when localized texts are taken from the system, they must be
> decoded from the locale encoding.

If you really do have a Grand Plan of how to integrate locales and
Unicode happily, congratulations.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

Re: Interpretation of non-UTF8 strings

Reply via email to