Re: Interpretation of non-UTF8 strings

Marcin 'Qrczak' Kowalczyk Mon, 16 Aug 2004 09:56:55 -0700

W liście z pon, 16-08-2004, godz. 16:50 +0100, Nick Ing-Simmons napisał:


> So in my speech synthesis stuff I had:
> 
> use encoding qw(iso-8859-15);
> 
> And then it worked right even if I happened to run it in en_GB.utf8 
> that day.

Places where texts are exchanged are divided into three groups:
1. The encoding must be specified externally or guessed.
2. The protocol includes specifying the encoding of its data.
3. The encoding is fixed by the protocol.

Examples:
1. Local filenames on Linux, @ARGV, local text file contents,
   script source, sockets by default, IRC.
2. Many Internet protocols: WWW, email, usenet. XML files.
3. Gtk2 API (UTF-8), WinNT filenames (UTF-16).

There are also two models how a Perl script may operate, which should
better not be mixed in one program:
A. The old model: it tries to work on the original encoding of the data.
   Uses non-UTF-8 scalars exclusively if the encoding is a byte encoding
   other than UTF-8, uses UTF-8 scalars if it's UTF-8, some things break
   for multibyte encodings other than UTF-8 (e.g. regexps).
B. The new model: it uses Unicode internally, which is physically
   represented by non-UTF-8 scalars if it happens to fit in ISO-8859-1
   and by UTF-8 scalars otherwise.

Making all combinations working is a lot of work. A1 was simple when it
worked. B3 is also reasonably simple, and this is the model the world
should generally aim at, or at least to B2.

The problem with switching from A to B is that B1 is not as simple to
handle as A1. Many sources are still like 1, so there is a tendency to
use the A model, especially when programming in the C language. Newer
languages often try to use the B model exclusively (Java, .NET), because
B2 and B3 are better than A2 and A3.

In order for Perl to be usable in model B, it should try harder to make
B1 working, because many places where texts are exchanged are still of
the kind of 1. How well B1 works means how widely Unicode is usable.

Ideally places in group 1 should have their encoding settable
individually or in related clusters, with a common default taken from
the locale. The common default might also be settable because taking the
encoding from the locale might not be reliable.

Places in group 2 might be sometimes overridable individually, but the
default should come from the protocol rather than from the locale, and
being able to override is not as important as in group 1. It's probably
enough to be able to see raw data without recoding as an alternative to
applying the encoding from the protocol.

Places in group 3 should not be overridable, they should just work.
It's important that they get the correct encoding even if Perl is used
according to model A, if this model is still considered fully supported.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Interpretation of non-UTF8 strings

Reply via email to