Re: Interpretation of non-UTF8 strings

Nick Ing-Simmons Mon, 16 Aug 2004 07:24:40 -0700

Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> writes:
>W liÅcie z pon, 16-08-2004, godz. 11:16 +0100, Nick Ing-Simmons napisaÅ:
>
>> >Perl treats them inconsistently. On one hand they are read from files
>> >and used as filenames without any recoding, which implies that they are
>> >assumed to be in some unspecified default encoding. 
>> 
>> Actually perl makes no such assumption - this is just historical
>> "it just works" code which is compatible with perl's before 5.6.
>
>There is a reasonable assumption that all textual data are in some
>unspecified but consistent encoding unless specified otherwise.
>
>The historical model is that data is manipulated in its stored encoding
>directly. It kind of works, as long as all data uses the same encoding,
>and as long as locale is consulted when the actual meaning of non-ASCII
>bytes is important (which can be quite hard if the encoding happens to
>be UTF-8 or another multibyte encoding).
>
>Since exchanging non-ASCII data between computers becomes more
>important, and the assumption that all data uses the same encoding too
>often becomes false, and a multibyte encoding - UTF-8 - becomes more
>common, another text processing model appears. The model is to use
>Unicode internally, and convert data on I/O. This model is usually
>better suited for handling non-ASCII data, especially if different
>sources use different encodings, but it's a switch from existing
>practice, so it's not universally adopted yet.
>
>It's convenient to assume that this conversion uses some default
>encoding unless specified otherwise, so not all programs must deal with
>encodings explicitly. Programs which don't specify encodings at all work
>too, as long as all data they encounter is encoded using that encoding.
>The locale mechanism is used on Unix to specify the default encoding
>and other things.
>
>In my case the encoding is ISO-8859-2. It will become UTF-8 in future
>when more programs are compatible with UTF-8.
>
>> >On the other hand
>> >they are upgraded to UTF-8 as if they were ISO-8859-1.
>> 
>> This is possibly dubious practice, but was what happened in 5.6 
>> which had Unicode but no Encode module. That situation lasted 
>> long enough that there is a code base that relies on it.
>
>This is broken.


But (sadly) we have to be compatible with some 5.6 codebase.

>
>perl -e 'use Glib; use Gtk2 -init;
>$window = Gtk2::Object->new(Gtk2::Window, title => "ÄÄÄÅÅÃÅÅÅ");
>$window->show_all(); Gtk2->main()'

so perl knows what you are doing.


>
>It shows incorrect title: characters are treated as if they were
>ISO-8859-1. It's unreasonable to assume that everybody lives in USA or
>Western Europe and uses ISO-8859-1. I have locale set correctly to pl_PL
>with ISO-8859-2. How to tell Perl to respect that?

Add 'use encoding qw(iso-8859-2);' 


>
>> >IMHO it would be more logical to assume that strings without the UTF-8
>> >flag are in some default encoding, probably taken from the locale.
>> >Upgrading them to UTF-8 should take it into account instead of blildly
>> >assuming ISO-8859-1, 
>> 
>> It would be more logical but would break things.
>
>They are already broken by assuming that everyone uses ISO-8859-1.

perl5.8 allows you to specify it.
If you don't specify it it assumes perl5.6 compatibility mode ;-)

Re: Interpretation of non-UTF8 strings

Reply via email to