Porters,

In the recent discussion in various perl-related MLs in Japanese, I have discovered a problem that the encoding pragma does not work on such multibyte encodings as Shift_JIS which uses 0x00-0x7f ranges in the 2nd byte. Though not test I am pretty sure big5 is also prone to this.

To understand this problem please have a look at the hexdump below;

% hexdump -C enc-sjis.pl
00000000 23 2f 75 73 72 2f 6c 6f 63 61 6c 2f 62 69 6e 2f |#/usr/local/bin/|
00000010 70 65 72 6c 20 2d 77 0a 75 73 65 20 73 74 72 69 |perl -w.use stri|
00000020 63 74 3b 0a 75 73 65 20 65 6e 63 6f 64 69 6e 67 |ct;.use encoding|
00000030 20 27 73 68 69 66 74 2d 6a 69 73 27 3b 0a 0a 6d | 'shift-jis';..m|
00000040 79 20 24 6e 61 6d 65 20 3d 20 22 94 5c 22 3b 0a |y $name = ".\";.|
00000050 70 72 69 6e 74 20 24 6e 61 6d 65 3b 0a 77 72 69 |print $name;.wri|
00000060 74 65 3b 0a 0a 66 6f 72 6d 61 74 20 53 54 44 4f |te;..format STDO|
00000070 55 54 20 3d 0a 94 5c 97 cd 3a 40 3c 3c 3c 0a 24 |UT =..\..:@<<<.$|
00000080 6e 61 6d 65 0a 2e 0a |name...|
The perl script is a valid perl script in Shift JIS but the quoted character (U+80fd, \x94\x5c in Shift_JIS) uses \x5c in the 2nd byte, mangling the script. The encoding pragma needs to be parsable ASCII-wise.
Fortunately, the encoding pragma offers a different approach via Filter=>1. The problem is that Filter option was incomplete in two ways.

0. Filter=>1 leaves STD(IN|OUT) untouched. Not only does it leave STD* untouched it completely ignores STD*=> hooks that non-filter version offers.

1. In order to touch STD(IN|OUT) sensibly you have to 'use utf8' in the script to make sure the literals therein are utf8-flagged but that makes the code too counterintuitive.

The following patch fixes that so the filter option is more useful. I am planning to apply this patch to the next version of Encode but I still need to fix the POD and write test suites. So I decided to issue a waring before committing a release.

Dan the Encode Maintainer

--- encoding.pm 2003/01/22 03:29:07 1.40
+++ encoding.pm 2003/01/26 07:03:59
@@ -35,33 +35,11 @@
unless ($arg{Filter}) {
${^ENCODING} = $enc unless $] <= 5.008 and $utfs{$name};
$HAS_PERLIO or return 1;
- for my $h (qw(STDIN STDOUT)){
- if ($arg{$h}){
- unless (defined find_encoding($arg{$h})) {
- require Carp;
- Carp::croak("Unknown encoding for $h, '$arg{$h}'");
- }
- eval { binmode($h, ":encoding($arg{$h})") };
- }else{
- unless (exists $arg{$h}){
- eval {
- no warnings 'uninitialized';
- binmode($h, ":encoding($name)");
- };
- }
- }
- if ($@){
- require Carp;
- Carp::croak($@);
- }
- }
}else{
defined(${^ENCODING}) and undef ${^ENCODING};
eval {
require Filter::Util::Call ;
Filter::Util::Call->import ;
- binmode(STDIN);
- binmode(STDOUT);
filter_add(sub{
my $status;
if (($status = filter_read()) > 0){
@@ -71,7 +49,31 @@
$status ;
});
};
+ # internally use utf8 to make sure utf8 flags are set
+ # for literals.
+ use utf8 (); # to fetch $utf8::hint_bits;
+ $^H |= $utf8::hint_bits;
# warn "Filter installed";
+ }
+ for my $h (qw(STDIN STDOUT)){
+ if ($arg{$h}){
+ unless (defined find_encoding($arg{$h})) {
+ require Carp;
+ Carp::croak("Unknown encoding for $h, '$arg{$h}'");
+ }
+ eval { binmode($h, ":encoding($arg{$h})") };
+ }else{
+ unless (exists $arg{$h}){
+ eval {
+ no warnings 'uninitialized';
+ binmode($h, ":encoding($name)");
+ };
+ }
+ }
+ if ($@){
+ require Carp;
+ Carp::croak($@);
+ }
}
return 1; # I doubt if we need it, though
}

Reply via email to