Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> writes: >W liÅcie z pon, 16-08-2004, godz. 11:16 +0100, Nick Ing-Simmons napisaÅ: > >> >Perl treats them inconsistently. On one hand they are read from files >> >and used as filenames without any recoding, which implies that they are >> >assumed to be in some unspecified default encoding. >> >> Actually perl makes no such assumption - this is just historical >> "it just works" code which is compatible with perl's before 5.6. > >There is a reasonable assumption that all textual data are in some >unspecified but consistent encoding unless specified otherwise. > >The historical model is that data is manipulated in its stored encoding >directly. It kind of works, as long as all data uses the same encoding, >and as long as locale is consulted when the actual meaning of non-ASCII >bytes is important (which can be quite hard if the encoding happens to >be UTF-8 or another multibyte encoding). > >Since exchanging non-ASCII data between computers becomes more >important, and the assumption that all data uses the same encoding too >often becomes false, and a multibyte encoding - UTF-8 - becomes more >common, another text processing model appears. The model is to use >Unicode internally, and convert data on I/O. This model is usually >better suited for handling non-ASCII data, especially if different >sources use different encodings, but it's a switch from existing >practice, so it's not universally adopted yet. > >It's convenient to assume that this conversion uses some default >encoding unless specified otherwise, so not all programs must deal with >encodings explicitly. Programs which don't specify encodings at all work >too, as long as all data they encounter is encoded using that encoding. >The locale mechanism is used on Unix to specify the default encoding >and other things. > >In my case the encoding is ISO-8859-2. It will become UTF-8 in future >when more programs are compatible with UTF-8. > >> >On the other hand >> >they are upgraded to UTF-8 as if they were ISO-8859-1. >> >> This is possibly dubious practice, but was what happened in 5.6 >> which had Unicode but no Encode module. That situation lasted >> long enough that there is a code base that relies on it. > >This is broken.
But (sadly) we have to be compatible with some 5.6 codebase. > >perl -e 'use Glib; use Gtk2 -init; >$window = Gtk2::Object->new(Gtk2::Window, title => "ÄÄÄÅÅÃÅÅÅ"); >$window->show_all(); Gtk2->main()' so perl knows what you are doing. > >It shows incorrect title: characters are treated as if they were >ISO-8859-1. It's unreasonable to assume that everybody lives in USA or >Western Europe and uses ISO-8859-1. I have locale set correctly to pl_PL >with ISO-8859-2. How to tell Perl to respect that? Add 'use encoding qw(iso-8859-2);' > >> >IMHO it would be more logical to assume that strings without the UTF-8 >> >flag are in some default encoding, probably taken from the locale. >> >Upgrading them to UTF-8 should take it into account instead of blildly >> >assuming ISO-8859-1, >> >> It would be more logical but would break things. > >They are already broken by assuming that everyone uses ISO-8859-1. perl5.8 allows you to specify it. If you don't specify it it assumes perl5.6 compatibility mode ;-)