For the record --
Is UTF-8 input coming from the likes of Apache a possible source of
failure? Pack may need to allow for endian-ness of a specific machine.
Well, it depends on how one looks at things, perhaps. I think one of
the probable reasons for the failure in the DWIM machinery was that I
am insisting on using shift-JIS characters in the source file instead
of utf-8 in strings and comments. But, no, Apache wasn't filtering
shift-JIS to utf-8 for me. Byte order also was not the problem.
After several hours of analysis (using more of the stuff that made
the original posting of the source somewhat opaque), I determined
that the problem derived from perl sometimes being stricter about
shift-JIS than I wanted it to be.
I don't know why the '+' substitute for space would switch to strict
character interpretation, but it seems to have been doing so.
Shift-JIS is a variable byte width encoding, one or two bytes. Lead
bytes are inherently not valid as single-byte characters. Trailing
bytes are sometimes valid as single-byte characters and sometimes
not. If the regular expression engine is not checking for valid
bytes, all you have to do is string the decoded bytes together. But
if it is checking for valid bytes, you have to put the decoded bytes
into something other than a char. (Blame C for folding the type of a
byte onto the type of a character.)
But if you are collecting into 16-bit words, you have to actually
check for the lead bytes yourself. I'm sure someone could put an RE
together that would do it, but I just decided it was going to be
simpler to check and build the string by hand.
So, for anybody who's curious, here's what I'm doing for now:
-----------------------------------------
my $qString = $ENV{'QUERY_STRING'};
my @list = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{ my ( $key, $value ) = split( '=', $pair, 2 );
# Really should just give in and use CGI.
# $key =~ tr/+/ /; # You don't expect space in identifiers, but,
...
$key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
# $queries{ $key . '_' } = $value; # dbg
$value =~ tr/+/ /;
my ( $byteAccm, $hexAccm, $conv ) = ( 0, undef, '' );
while ( $value =~ m/%([\dA-Fa-f][\dA-Fa-f])|(.)/g )
{ if ( defined ( $1 ) )
{ my $hexValue = $1;
my $decValue = hex ( $hexValue );
if ( ! defined ( $hexAccm ) )
{ if ( $decValue <= 0x80 || ( $decValue >= 0xa0 && $decValue <
0xe0 ) || $decValue >= 0xfd )
{ $conv .= pack( 'C', $decValue );
}
else # Lead byte -- loose checks all around.
{ $byteAccm = $decValue;
$hexAccm = $hexValue;
}
}
else
{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <
0xe0 ) )
$conv .= pack( 'S', ( $byteAccm << 8 ) +
$decValue );
$byteAccm = 0;
$hexAccm = undef;
}
}
else
{ my $cValue = $2;
my $decValue = ord ( $cValue );
if ( ! defined ( $hexAccm ) )
{ $conv .= $cValue;
}
else
{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <
0xe0 ) )
$conv .= pack( 'S', ( $byteAccm << 8 ) +
$decValue );
$byteAccm = 0;
$hexAccm = undef;
}
}
}
$queries{ $key } = $conv;
}
-----------------------------------------
If this were production code, I should check some more gaps in the
lead byte (and check where the newest JIS adds the extra several
thousand characters) and uncomment the checks on the trailing bytes
(and add some trailing byte checks specific to certain lead bytes,
geagh). But then I have to figure out what to do with bad bytes.
Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)