Re: encoding neutral unpack

Ton Hospel Sat, 29 Jan 2005 12:03:40 -0800

In article <[EMAIL PROTECTED]>,
        Rafael Garcia-Suarez <[EMAIL PROTECTED]> writes:


Before replying let me first talk a bit more about what "U0" and "C0" mode
mean and the philosophy of the patch.

And before doing that let me first point out that we currently have a 
problem even without trying to make unpack encoding neutral.

Take the following three statements:

1) What pack in current perl does for C0 is basically ok:
perl -wle 'use Devel::Peek; Dump(pack("C0U*", 8188))'
SV = PV(0x8162464) at 0x817547c
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK)
  PV = 0x816b750 "\341\277\274"\0
  CUR = 3
  LEN = 14
2) What unpack currently does for C0 is basically ok:
perl -wle 'print for unpack("C0U*", "\341\277\274")'
225
191
188
3) pack and unpack should basically be each others reverse
   (see how the two above are not)

So obviously you MUST drop (at least) one of these statements.

Notice that there is no upgrading/downgrading or so going on here,
I'm not invoking any of my op[inions about unpack neutrality.

So which one to drop/change ? I don't want to change 1). It's a sane 
behaviour and it's in fact exactly the one documented. I also don't want 
to change 3). I think reversibility should be a basic design criterium 
for pack/unpack. 2) is what is left.

To my mind what current perl unpack does is simply a bug.

Another reason to pick statement 2 is that in fact C0 and U0 mode for 
unpack were never documented. 

So what should U0 and C0 mean ?
Consider the section from the pack docs:

       *       If the pattern begins with a "U", the resulting string
               will be treated as UTF-8-encoded Unicode. You can force
               UTF-8 encoding on in a string with an initial "U0", and
               the bytes that follow will be interpreted as Unicode
               characters. If you don't want this to happen, you can
               begin your pattern with "C0" (or anything else) to
               force Perl not to UTF-8 encode your string, and then
               follow this with a "U*" somewhere in your pattern.

So clearly "C0" mode is meant as the "normal" mode. The packed string 
is simply a string of bytes, and you pack or unpack byte by byte.

Now I get to the wish that a string getting upgraded (which can easily
happen by accident in the bright new unicode world) should not change it's
meaning for unpack. If a string gets upgraded the thing that used to be
a byte is now a character ((possibly) multibyte UTF8 sequence internally).

So in normal (C0) mode, an upgraded string should be processed with 
character semantics.

Next we get to "U" at the start, a behaviour you can also cause by putting 
U0 at the start. It means that the string gets built byte by byte, but in
the end the UTF8 flag is turned on without upgrading. From the perl point
of view the string *IS* an upgraded string (but the character sequence
it represents is not (necessarily) the byte values you used to pack it)

So in U0 mode, an upgraded string (string with the utf8-flag on) is packed
and should be unpacked using byte semantics.

pack/unpack is also popularly used to write packed data to 
sockets/old-style files, which will later get processed. If an upgraded 
string written to such a byte resource, perl will try to downgrade it.
if it succeeds in doing that, and the user starts to unpack it, it would
be nice if he still got the same stuff as he packed before, even in U0 mode.

This is in fact perfectly feasable by processing the packed string by 
(conceptually) upgrading it and then processing it as bytes.

So for a encoding neutral pack/unpack pair, think of the packed string 
in it's corresponding upgraded (utf8-expanded) form. C0 and U0 mode are 
then easily understood:

C0 mode: process per character
U0 mode: process per underlying byte

This formulation is trivially encoding neutral since it only thinks
in terms of the upgraded string.

As an implementation optimization for C0 mode we don't actually upgrade
the string if doesn't have the utf8-flag, but leave it as it is and process
it as a byte sequence (which would be equivalent to a character sequence
on the upgraded form). So you see that this is perfectly compatible with
pre-unicode pack/unpack which behaves exactly like that.

I have the impression that many people (including who wrote the current
perl unpack code for U0 and C0) think of C0 mode as byte-mode and U0 mode
as unicode mode and this is where the confusion comes from. To be compatible
with what pack actually does and implements, think of C0 mode as 
"character mode" and of U0 mode as "utf8-mode" (were you get to see the bytes
that make up the UTF8-expansion).

Apart from the meaning of U0 and C0, there's also the question of in which
mode we start. That's also pretty clear from the documentation of pack:

If the packformat starts with U, you start in U0 mode, in all other cases you
start in C0 mode.

This interpretation of the world is exactly what my patch implemented.

(Except that in the meantime I realize what I did for "U" in C0 mode
was wrong. If I succeed in convincing people that what I'm aiming for is the
right thing I'll send the updated patch. Let me just state that the fix 
doesn't change any of the patches that are needed outside pp_pack.c)

Ok, now to my answer to Rafael:

> As Nicholas is working on pp_pack currently, your patches will need
> adaptations.

mm, as far as I saw what nicolas did is purely orthogonal. But if I suceed 
in convincing people what I do is right I'll make a new patch.

> Moreover it touches Encode, which is supposed to work on
> 5.8.x perls.

In fact of the Encode code only ext/Encode/lib/Encode/MIME/Header.pm 
(the one in lib/Encode/MIME/Header.pm comes directly from that during the
build) has to change, and in only one place:

--- perl-clean/ext/Encode/lib/Encode/MIME/Header.pm      Thu May 20 15:51:06 
2004
+++ perl-dev/ext/Encode/lib/Encode/MIME/Header.pm        Fri Jan 21 23:08:12 
2005
-                   join("" => map {sprintf "=%02X", $_} unpack("C*", $1))
+                   join("" => map {sprintf "=%02X", $_} unpack("U0C*", $1))

That one is unavoidable if you want to make unpack encoding neutral since
it uses C* to see "through" the string encoding. By switching to U0 mode
it is guaranteed to get the utf-8 bytes making up the encoding of $1.

Its easy to make this conditional on the unpack behaviour, so it's
hardly a show stopper (actually testing on my 5.8.6 uncoditionally adding 
the U0 there even *without* my patch passes all tests). And of course I
wasn't proposing my patches for perl 5.8 anyways, the patch was relative
to bleed.

There is also one test in ext/Encode/t/encoding.t where again unpack
is used to see "through" the encoding. So that also had to be fixed.

> 
> Further comments below.
> 
>> Things to note:
>>   - The trick of using unpack("C*", $string) to see "through" the encoding
>>     doesn't work anymore 
>>     (which is as it should be if unpack is encoding-neutral)
>>     You can run unpack under "use bytes" to get the old effect though.
> 
> What about forcing with C0 ?

C0 *IS* the default mode, so it's already what's happening. Nor will U0C*
see "through" the encoding. Instead, it will tell you which bytes you need
for the UTF-8 encoding of the argument (irrespective of whether the 
encoding the argument happens to have).

Seeing "through" the encoding is basically incompatible with being
encoding neutral. It would need a third mode beyond U0 and C0 (both of 
which have an encoding neutral meaning now). It's definitely possible, 
but I don't really see the need since we already have "use bytes" for
this.

----
I snipped your comments about not wanting to change the meaning of
C0 and U0 in unpack here. I hope I addressed them above and convinced
you that the current behaviour is a bug and by reading C0 and U0 as 
"character mode" and "utf8 mode" my proposed behaviour is perfectly 
natural.
----
>>   - For the moment I made unpack "C" on a char >= 256 be an error, though
>>     later on I'd like to make it just basically do ord(). C is however
>>     the most likely format that users will notice the changed semantics, so 
>>     for a first version it might be good to keep it like this so users can
>>     flush out errors. (since A, a and Z checksumming is currently delegated
>>     to "C", that also implies that these won't checksum chars >= 256)
> 
>>   - Fixed a minor bug in that checksumming over a partial byte in 'B' formats
>>     forgot to advance the pointer (needs to be applied to maint too)
> 
> Sumbit it separately.

Done.

> 
>>   - U0 or starting with U on an octet string is completely yucky to implement
>>     as a special case in all formats, so I do these by upgrading a temporary 
>>     string. This is inefficient if you only unpack few things from a huge
>>     string (fortunately this case shouldn't be normal. pack with an initial
>>     U would return an already upgraded string, so this would only happen if
>>     the string later got somehow degraded). Still might be worth warning 
>> about
>>     in the docs.
> 
> OK.
> 
>>   - If the first character in the unpackstring is U it now behaves like
>>     there is an U0 before (this is consistent with the documentation
>>     and behaviour of pack, and needed for reversibility)
> 
> This seems to contradict the following :

Doesn't contradict it, but I was unclear of in which mode things START.
If the packstring starts with U, processing starts in U0 mode, otherwise
in C0 mode.

> 
>>   - Only literal C0 and U0 cause a mode switch, not ones implied by
>>     something like: unpack("C/U", "\x00")
>>     (I consider that a bugfix that should also be applied to maint)
> 
> Sounds likely; sumbit it separately.

Done

> 
>>   - C0 and U0 modes are scoped to (), so in format "s(nU0v)2S", the U0 mode
>>     only applies to the v, NOT to the S or the n (in the second round)
>>     (I also consider the old behaviour here a bug. It made multiround
>>     groups and C/() style groups too unpredictable)
> 
> Right. I don't know whether this is feasible to submit it separately,
> though.

Was easy in fact. Submitted.

> 
>>   - a,A,Z now conserve unicodeness when extracting from an unicode packed 
>>     string (can't happen (currently) if the string was constructed with
>>     pack).
>>     e.g.:
>>
>>     ./perl -Ilib -wle 'use Devel::Peek; Dump(unpack("a*", pack("U*", 8188)))'
>>     SV = PV(0x8162ef8) at 0x8162b00
>>     REFCNT = 1
>>     FLAGS = (TEMP,POK,pPOK,UTF8)
>>     PV = 0x8167bf0 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
>>     CUR = 3
>>     LEN = 4
>>
>>     It could be argued that I should try to do sv_utf8_downgrade(sv, 1)
>>     before returning, but I prefer the current code. It allows proper 
>>     fixed format parsing of unicode (and byte) strings.
> 
> How so ? why has it changed ? a side effect of the changes to U formats ?

It's about using unpack to do fixed format processing of strings. I've 
quite often seen it advised on e.g. perlmonks. Like this:

perl -wle '$_ = "�bcdef"; print for unpack("a2a3", $_)'
�b
cde

Looks nice, but doesn't actually work when unicode contamination enters 
the picture:

perl -wle '$_ = "\340bcdef"; utf8::upgrade($_); print for unpack("a2a3", $_)'
�
bcd

Whoops.

And it's of course completely hopeless if we have real >= 256 chars:

perl -wle '$_ = chr(8188) . "bcdef"; print for unpack("a2a3", $_)'
�
�bc

In my patch, I properly parse per character instead of by byte, so I made
it work as I think it SHOULD work:

./perl -Ilib -wle '$_ = "�bcdef"; utf8::upgrade($_); print for unpack("a2a3", 
$_)'
�b
cde

./perl -Ilib -wle '$_ = chr(8188) . "bcdef"; utf8::upgrade($_); print for 
unpack("a2a3", $_)'
Wide character in print at -e line 1.
ῼb
cde

>> Probably this has too many semantics changes to apply it to to maint 
>> (though I think that all places where it makes a difference are bugs 
>> really), 
>> but I think it would be proper for blead.

The only reason I said this is to avoid catching out users that use the 
current behaviour in a stable perl (probably mainly the trick of using 
C* to see "through" the encoding, U0 and C0 mode for unpack were never
documented). But I think it should be changed for 5.10 since I think the 
current behaviour is a bug.

Re: encoding neutral unpack

Reply via email to