W liście z pon, 16-08-2004, godz. 18:56 +0200, Marcin 'Qrczak' Kowalczyk napisał:
> There are also two models how a Perl script may operate, which should > better not be mixed in one program: > A. The old model: it tries to work on the original encoding of the data. > Uses non-UTF-8 scalars exclusively if the encoding is a byte encoding > other than UTF-8, uses UTF-8 scalars if it's UTF-8, some things break > for multibyte encodings other than UTF-8 (e.g. regexps). > B. The new model: it uses Unicode internally, which is physically > represented by non-UTF-8 scalars if it happens to fit in ISO-8859-1 > and by UTF-8 scalars otherwise. In my Kogut<->Perl bridge I would like to use Perl in the B model, because I was told here that non-UTF-8 scalars are interpreted according to the B model when they are mixed with UTF-8 scalars, so this is the only model which makes sense when Unicode is used. It's also more convenient for me because Kogut strings are Unicoded internally. What will be the necessary Perl interpreter invocation arguments to make this work? See below for what I mean by "work". The following places exchange text with the external world, encoding characters as bytes, without an explicit encoding specified by the protocol, so they should use the encoding of my choice which I will put somewhere in the invocation arguments (which will usually be the default encoding of the locale), or they should use the default encoding of the locale themselves - either of this is fine for me: - file contents, including stdin/stdout/stderr and sockets, unless overridden explicitly - filenames (including functions like mkdir, stat, glob) - arguments of system and exec - @ARGV - %ENV - $! when it contains the result of strerror() - and probably other similar things I've forgotten. There are also places in the Perl API which use Perl scalars. They should always interpret them according to the B model, i.e. a scalar with the UTF-8 flag turned off is interpreted as ISO-8859-1. There are also places which don't have to support more than ASCII, but it would be nice if they had an official interpretation of non-ASCII characters, either the locale encoding or ISO-8859-1 I suppose, so I know how to convert Unicode strings for them: - variable and package names (get_sv, gv_stashpv) The encoding of the script source should be specified separately from everything else, because it's depends on how the script has been written, while others depend on where it is run. In the case of my Kogut<->Perl bridge there is no such thing as script source (the interpreter is invoked with -e ""). But code might call eval_sv among other things, and it's argument, being a Perl scalar, should be interpreted as above. Note: with these options: -Mencoding=$ENCODING -Mopen=:encoding($ENCODING) file contents are recoded correctly, but all other things are broken, including eval_sv which interprets non-UTF-8 strings according to the locale. OTOH with this option: -Mopen=:encoding($ENCODING) eval_sv works, but stdin/stdout/stderr are not recoded. Note: $! has the same weird behavior as @ARGV: $ perl -Mencoding=ISO-8859-2 -Mopen=:encoding\(ISO-8859-2\) -e ' eval {open F, "/etc/shadow"}; print "$!\n"' Brak dostępu $ perl -Mencoding=ISO-8859-2 -Mopen=:encoding\(ISO-8859-2\) -e ' eval {open F, "/etc/shadow"}; print substr($!, 0), "\n"' "\x{00ea}" does not map to iso-8859-2 at -e line 1. Brak dost\x{00ea}pu -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/