Re: Latin-1-characters

Karl Brodowsky Tue, 16 Mar 2004 13:06:09 -0800

Dear All,

from what has been written by others, there are enough useful encodings other
than utf-8, utf-16/UCS-2 and UCS-4 that support efficient storage even
for unicode-files whose contents are Greek, Cyrillic, etc..  Sorry for the confusion
caused by the fact that I was not aware of these.

utf-8 is fine for languages like German, Polish, Norwegian, Spanish, French,... which have >= 90% of the text with ASCII-7-bit-characters.

Add perl to that list, by the way. I rather strongly suspect that most perl code will consist mostly of 7-bit characters. (Even perl code written by traditional-Chinese-speakers (and I pick on traditional Chinese only because it has a very large character repituar -- one of the reasons there's a "simplified" variant).)


My experience would be that Perl-programs do contain local language and thus
local characters which might be outside of ISO-646-IRV (7-bit-ASCII) for
String-literals and for comments.

By the way, there is (should be) nothing that is encodable in a non-Unicode character set that is not encodable in (any encoding of) Unicode. That's where the "uni" bit comes from. If there is, it's means that Unicode is not fulfilling it's design goals.


Yes, we can consider any file to be unicode with some encoding.  That is
how the Java-guys do it, with the restriction that they don't easily let
you choose anything other than latin-1 + \ucafe-stuff for non-latin-1
characters (or maybe I didn't bother, because latin-1/ISO-8859-1 works
fine for me).

IMHO the OS should provide a standard way to specify such a charset as a file attribute, but usually it does not and it won't in the future, unless the file comes through the network and has a Mime-Header.

I think the answer is multi-fold.

0) Auto-detect the encoding in the compiler, if a U+FFEF signature, or a #! signature, is found at the beginning of the input. (If there is a FFEF signature, it should get thrown away after it is recognized. It may be possible to recoginze on "package" or "module" as well, and possibly even on "#".)


With FFFE and FEFF this seems obvious.  In case of #! it would not be clear
to me if this defaults to ISO-8859-1 (latin-1) or to utf-8.  See HTML
vs. XHTML as an example where the default has been changed.

1) Beleive what the underling FS/OS/transport tells us. (This is likely to be a constant for many OSes, possibly selectable at the compiler's compile-time. It's the encoding on the end of the content-type for HTTP and other MIME-based transports.)


I understand that the FS/OS do not really tell us, at least neither for
Unix/Linux nor for NT/Windows.  Relying on environment variables or locale
settings looks dangerous to me, because it breaks programs that worked fine
in environment A, when you run them elsewhere or it imposes restrictions
how to setup these environment variables.  It could be ok for one-liners
run from the command line like this
ls *.JPG|perl -p -e 's/(.*\.)JPG$/mv $1JPG $1jpg/;' |grep mv |sh
stuff.  This would work fine even for shell scripts, because they would have
to set the appropriate environment variables for themselves, thus disregarding
any user settings.  Probably something additional like PERL_DEFAULT_ENCODING,
because otherwise we might get clashes with (other) regular use of locale-settings.

In cases where the OS or FS really has a capability to provide encoding on a
per file basis as a file attribute or in cases where the file comes from the
network with a mime-header, your suggestion should be perfect.

2) Support a "use encoding 'foo'" similar to that in recent perl5s: It states the encoding that the file it appears in is written in.


Yes, that looks like the right way to do it.   And it eliminates part of the
concerns for 1), if it is assumed that this line use encoding is kind of required
in every non-trivial perl-source.  Btw. this is the encoding of the perl-source-code
itself, files that are processed by perl I/O could off course have any encoding.

(the higher-numbered sources of encoding information override the former ones.)

Yes, off course. 0) and 2) are obvious, but 1) might need to be dealt with carefully.

Best regards,

Karl

Re: Latin-1-characters

Reply via email to