On Tue, Mar 16, 2004 at 10:17:57PM +0100, Karl Brodowsky wrote: : With FFFE and FEFF this seems obvious. In case of #! it would not be clear : to me if this defaults to ISO-8859-1 (latin-1) or to utf-8. See HTML : vs. XHTML as an example where the default has been changed.
Perl 6 would certainly try to default to utf-8 rather than latin-1. : >1) Beleive what the underling FS/OS/transport tells us. (This is likely : >to be a constant for many OSes, possibly selectable at the compiler's : >compile-time. It's the encoding on the end of the content-type for HTTP : >and other MIME-based transports.) : : I understand that the FS/OS do not really tell us, at least neither for : Unix/Linux nor for NT/Windows. Relying on environment variables or locale : settings looks dangerous to me, because it breaks programs that worked fine : in environment A, when you run them elsewhere or it imposes restrictions : how to setup these environment variables. It could be ok for one-liners : run from the command line like this : ls *.JPG|perl -p -e 's/(.*\.)JPG$/mv $1JPG $1jpg/;' |grep mv |sh : stuff. This would work fine even for shell scripts, because they would have : to set the appropriate environment variables for themselves, thus : disregarding : any user settings. Probably something additional like : PERL_DEFAULT_ENCODING, : because otherwise we might get clashes with (other) regular use of : locale-settings. : : In cases where the OS or FS really has a capability to provide encoding on a : per file basis as a file attribute or in cases where the file comes from the : network with a mime-header, your suggestion should be perfect. If the metadata can be trusted, then we'll trust the metadata. Otherwise, Perl 6 will attempt heuristics only if it recognizes a file with high bits set that can't possibly be any common Unicode encoding (where that could include SCSU, I suppose, if it starts with 0e fe ff). : >2) Support a "use encoding 'foo'" similar to that in recent perl5s: It : >states the encoding that the file it appears in is written in. : : Yes, that looks like the right way to do it. And it eliminates part of the : concerns for 1), if it is assumed that this line use encoding is kind of : required : in every non-trivial perl-source. Btw. this is the encoding of the : perl-source-code : itself, files that are processed by perl I/O could off course have any : encoding. Yes, and you change those from the defaults with other pragmas or options. : >(the higher-numbered sources of encoding information override the former : >ones.) : : Yes, off course. 0) and 2) are obvious, but 1) might need to be dealt with : carefully. Yes, 1) needs to be split into 1) "metadata" and -1) "heuristics". 1) is only reliable metadata, which specifically does NOT include environment variables. If the metadata is only guessing, it's better to assume 0) and only if that fails regress to any -1) heuristics. Larry