Dear All,

from what has been written by others, there are enough useful encodings other
than utf-8, utf-16/UCS-2 and UCS-4 that support efficient storage even
for unicode-files whose contents are Greek, Cyrillic, etc..  Sorry for the confusion
caused by the fact that I was not aware of these.

utf-8 is fine for languages like German, Polish, Norwegian, Spanish, French,...
which have >= 90% of the text with ASCII-7-bit-characters.

Add perl to that list, by the way. I rather strongly suspect that most perl code will consist mostly of 7-bit characters. (Even perl code written by traditional-Chinese-speakers (and I pick on traditional Chinese only because it has a very large character repituar -- one of the reasons there's a "simplified" variant).)

My experience would be that Perl-programs do contain local language and thus local characters which might be outside of ISO-646-IRV (7-bit-ASCII) for String-literals and for comments.

By the way, there is (should be) nothing that is encodable in a non-Unicode character set that is not encodable in (any encoding of) Unicode. That's where the "uni" bit comes from. If there is, it's means that Unicode is not fulfilling it's design goals.

Yes, we can consider any file to be unicode with some encoding. That is how the Java-guys do it, with the restriction that they don't easily let you choose anything other than latin-1 + \ucafe-stuff for non-latin-1 characters (or maybe I didn't bother, because latin-1/ISO-8859-1 works fine for me).

IMHO the OS should provide a standard way to specify such a charset as a file attribute,
but usually it does not and it won't in the future, unless the file comes through the
network and has a Mime-Header.

I think the answer is multi-fold.

0) Auto-detect the encoding in the compiler, if a U+FFEF signature, or a #! signature, is found at the beginning of the input. (If there is a FFEF signature, it should get thrown away after it is recognized. It may be possible to recoginze on "package" or "module" as well, and possibly even on "#".)

With FFFE and FEFF this seems obvious. In case of #! it would not be clear to me if this defaults to ISO-8859-1 (latin-1) or to utf-8. See HTML vs. XHTML as an example where the default has been changed.

1) Beleive what the underling FS/OS/transport tells us. (This is likely to be a constant for many OSes, possibly selectable at the compiler's compile-time. It's the encoding on the end of the content-type for HTTP and other MIME-based transports.)

I understand that the FS/OS do not really tell us, at least neither for Unix/Linux nor for NT/Windows. Relying on environment variables or locale settings looks dangerous to me, because it breaks programs that worked fine in environment A, when you run them elsewhere or it imposes restrictions how to setup these environment variables. It could be ok for one-liners run from the command line like this ls *.JPG|perl -p -e 's/(.*\.)JPG$/mv $1JPG $1jpg/;' |grep mv |sh stuff. This would work fine even for shell scripts, because they would have to set the appropriate environment variables for themselves, thus disregarding any user settings. Probably something additional like PERL_DEFAULT_ENCODING, because otherwise we might get clashes with (other) regular use of locale-settings.

In cases where the OS or FS really has a capability to provide encoding on a
per file basis as a file attribute or in cases where the file comes from the
network with a mime-header, your suggestion should be perfect.

2) Support a "use encoding 'foo'" similar to that in recent perl5s: It states the encoding that the file it appears in is written in.

Yes, that looks like the right way to do it. And it eliminates part of the concerns for 1), if it is assumed that this line use encoding is kind of required in every non-trivial perl-source. Btw. this is the encoding of the perl-source-code itself, files that are processed by perl I/O could off course have any encoding.

(the higher-numbered sources of encoding information override the former ones.)

Yes, off course. 0) and 2) are obvious, but 1) might need to be dealt with carefully.


Best regards,

Karl



Reply via email to