Re: Latin-1-characters

Larry Wall Tue, 16 Mar 2004 14:17:12 -0800

On Tue, Mar 16, 2004 at 10:17:57PM +0100, Karl Brodowsky wrote:
: With FFFE and FEFF this seems obvious.  In case of #! it would not be clear
: to me if this defaults to ISO-8859-1 (latin-1) or to utf-8.  See HTML
: vs. XHTML as an example where the default has been changed.


Perl 6 would certainly try to default to utf-8 rather than latin-1.

: >1) Beleive what the underling FS/OS/transport tells us.  (This is likely 
: >to be a constant for many OSes, possibly selectable at the compiler's 
: >compile-time.  It's the encoding on the end of the content-type for HTTP 
: >and other MIME-based transports.)
: 
: I understand that the FS/OS do not really tell us, at least neither for
: Unix/Linux nor for NT/Windows.  Relying on environment variables or locale
: settings looks dangerous to me, because it breaks programs that worked fine
: in environment A, when you run them elsewhere or it imposes restrictions
: how to setup these environment variables.  It could be ok for one-liners
: run from the command line like this
: ls *.JPG|perl -p -e 's/(.*\.)JPG$/mv $1JPG $1jpg/;' |grep mv |sh
: stuff.  This would work fine even for shell scripts, because they would have
: to set the appropriate environment variables for themselves, thus 
: disregarding
: any user settings.  Probably something additional like 
: PERL_DEFAULT_ENCODING,
: because otherwise we might get clashes with (other) regular use of 
: locale-settings.
: 
: In cases where the OS or FS really has a capability to provide encoding on a
: per file basis as a file attribute or in cases where the file comes from the
: network with a mime-header, your suggestion should be perfect.

If the metadata can be trusted, then we'll trust the metadata.
Otherwise, Perl 6 will attempt heuristics only if it recognizes a file
with high bits set that can't possibly be any common Unicode encoding
(where that could include SCSU, I suppose, if it starts with 0e fe ff).

: >2) Support a "use encoding 'foo'" similar to that in recent perl5s: It 
: >states the encoding that the file it appears in is written in.
: 
: Yes, that looks like the right way to do it.   And it eliminates part of the
: concerns for 1), if it is assumed that this line use encoding is kind of 
: required
: in every non-trivial perl-source.  Btw. this is the encoding of the 
: perl-source-code
: itself, files that are processed by perl I/O could off course have any 
: encoding.

Yes, and you change those from the defaults with other pragmas or options.

: >(the higher-numbered sources of encoding information override the former 
: >ones.)
: 
: Yes, off course.  0) and 2) are obvious, but 1) might need to be dealt with 
: carefully.

Yes, 1) needs to be split into 1) "metadata" and -1) "heuristics".
1) is only reliable metadata, which specifically does NOT include
environment variables.  If the metadata is only guessing, it's better
to assume 0) and only if that fails regress to any -1) heuristics.

Larry

Re: Latin-1-characters

Reply via email to