Re: encoding neutral unpack

Rafael Garcia-Suarez Sun, 30 Jan 2005 06:40:32 -0800

Ton Hospel wrote:
> Take the following three statements:
> 
> 1) What pack in current perl does for C0 is basically ok:
> perl -wle 'use Devel::Peek; Dump(pack("C0U*", 8188))'
> SV = PV(0x8162464) at 0x817547c
>   REFCNT = 1
>   FLAGS = (PADTMP,POK,pPOK)
>   PV = 0x816b750 "\341\277\274"\0
>   CUR = 3
>   LEN = 14
> 2) What unpack currently does for C0 is basically ok:
> perl -wle 'print for unpack("C0U*", "\341\277\274")'
> 225
> 191
> 188
> 3) pack and unpack should basically be each others reverse
>    (see how the two above are not)
> 
> So obviously you MUST drop (at least) one of these statements.
> 
> Notice that there is no upgrading/downgrading or so going on here,
> I'm not invoking any of my op[inions about unpack neutrality.
> 
> So which one to drop/change ? I don't want to change 1). It's a sane 
> behaviour and it's in fact exactly the one documented. I also don't want 
> to change 3). I think reversibility should be a basic design criterium 
> for pack/unpack. 2) is what is left.


I'd prefer dropping 3), since the pack template mini-language is not
fully reversible anyway. Complex pack templates with groups and
repetitions are not applicable as-is to unpack; and, as perlpacktut
says :

    Be cautioned, however, that not all that has been packed together
    can be neatly unpacked - a very common experience as seasoned
    travellers are likely to confirm.

> Another reason to pick statement 2 is that in fact C0 and U0 mode for 
> unpack were never documented. 
> 
> So what should U0 and C0 mean ?
> Consider the section from the pack docs:
> 
>        *       If the pattern begins with a "U", the resulting string
>                will be treated as UTF-8-encoded Unicode. You can force
>                UTF-8 encoding on in a string with an initial "U0", and
>                the bytes that follow will be interpreted as Unicode
>                characters. If you don't want this to happen, you can
>                begin your pattern with "C0" (or anything else) to
>                force Perl not to UTF-8 encode your string, and then
>                follow this with a "U*" somewhere in your pattern.
> 
> So clearly "C0" mode is meant as the "normal" mode. The packed string 
> is simply a string of bytes, and you pack or unpack byte by byte.

So you're getting the documented meaning of C0 for pack and want to
apply it to (undocumented) unpack. That's where we're disagreeing, since
I'd prefer C0 to force unpack producing bytes, just like in current
perl.

> Now I get to the wish that a string getting upgraded (which can easily
> happen by accident in the bright new unicode world) should not change it's
> meaning for unpack. If a string gets upgraded the thing that used to be
> a byte is now a character ((possibly) multibyte UTF8 sequence internally).
> 
> So in normal (C0) mode, an upgraded string should be processed with 
> character semantics.
...
> So for a encoding neutral pack/unpack pair, think of the packed string 
> in it's corresponding upgraded (utf8-expanded) form. C0 and U0 mode are 
> then easily understood:
> 
> C0 mode: process per character
> U0 mode: process per underlying byte
> 
> This formulation is trivially encoding neutral since it only thinks
> in terms of the upgraded string.
> 
> As an implementation optimization for C0 mode we don't actually upgrade
> the string if doesn't have the utf8-flag, but leave it as it is and process
> it as a byte sequence (which would be equivalent to a character sequence
> on the upgraded form). So you see that this is perfectly compatible with
> pre-unicode pack/unpack which behaves exactly like that.
> 
> I have the impression that many people (including who wrote the current
> perl unpack code for U0 and C0) think of C0 mode as byte-mode and U0 mode
> as unicode mode and this is where the confusion comes from. To be compatible
> with what pack actually does and implements, think of C0 mode as 
> "character mode" and of U0 mode as "utf8-mode" (were you get to see the bytes
> that make up the UTF8-expansion).

"character" vs "utf8" is definitively confusing, perhaps "encoding" or
"byte" mode describes U0 better. The perlfunc/pack doc could be improved
on this.

My point on unpack is that the meaning of U0 and C0 should be reversed,
since they operate on the opposite sense. Otherwise, I agree with you.
But I have to admit that with the new meanings of C0 and U0 modes you
proposed, your new behaviour of unpack makes more sense. I'd still
prefer to update the docs than to break backwards compatibility (even if
this was not documented properly.)

> Apart from the meaning of U0 and C0, there's also the question of in which
> mode we start. That's also pretty clear from the documentation of pack:
> 
> If the packformat starts with U, you start in U0 mode, in all other cases you
> start in C0 mode.
> 
> This interpretation of the world is exactly what my patch implemented.
...
> > Moreover it touches Encode, which is supposed to work on
> > 5.8.x perls.
> 
> In fact of the Encode code only ext/Encode/lib/Encode/MIME/Header.pm 
> (the one in lib/Encode/MIME/Header.pm comes directly from that during the
> build) has to change, and in only one place:
> 
> --- perl-clean/ext/Encode/lib/Encode/MIME/Header.pm      Thu May 20 15:51:06 
> 2004
> +++ perl-dev/ext/Encode/lib/Encode/MIME/Header.pm        Fri Jan 21 23:08:12 
> 2005
> -                   join("" => map {sprintf "=%02X", $_} unpack("C*", $1))
> +                   join("" => map {sprintf "=%02X", $_} unpack("U0C*", $1))
> 
> That one is unavoidable if you want to make unpack encoding neutral since
> it uses C* to see "through" the string encoding. By switching to U0 mode
> it is guaranteed to get the utf-8 bytes making up the encoding of $1.

Right. That's an example of backwards-compatibility broken.

> Seeing "through" the encoding is basically incompatible with being
> encoding neutral. It would need a third mode beyond U0 and C0 (both of 
> which have an encoding neutral meaning now). It's definitely possible, 
> but I don't really see the need since we already have "use bytes" for
> this.

OK.

Thanks for your comments.

(Waiting for the separately sumbitted patches...)

> >>   - a,A,Z now conserve unicodeness when extracting from an unicode packed 
> >>     string (can't happen (currently) if the string was constructed with
> >>     pack).
> >>     e.g.:
> >>
> >>     ./perl -Ilib -wle 'use Devel::Peek; Dump(unpack("a*", pack("U*", 
> >> 8188)))'
> >>     SV = PV(0x8162ef8) at 0x8162b00
> >>     REFCNT = 1
> >>     FLAGS = (TEMP,POK,pPOK,UTF8)
> >>     PV = 0x8167bf0 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
> >>     CUR = 3
> >>     LEN = 4
> >>
> >>     It could be argued that I should try to do sv_utf8_downgrade(sv, 1)
> >>     before returning, but I prefer the current code. It allows proper 
> >>     fixed format parsing of unicode (and byte) strings.
> > 
> > How so ? why has it changed ? a side effect of the changes to U formats ?
> 
> It's about using unpack to do fixed format processing of strings. I've 
> quite often seen it advised on e.g. perlmonks. Like this:
> 
> perl -wle '$_ = "�bcdef"; print for unpack("a2a3", $_)'
> �b
> cde
> 
> Looks nice, but doesn't actually work when unicode contamination enters 
> the picture:
> 
> perl -wle '$_ = "\340bcdef"; utf8::upgrade($_); print for unpack("a2a3", $_)'
> �
> bcd
> 
> Whoops.
> 
> And it's of course completely hopeless if we have real >= 256 chars:
> 
> perl -wle '$_ = chr(8188) . "bcdef"; print for unpack("a2a3", $_)'
> �
> �bc
> 
> In my patch, I properly parse per character instead of by byte, so I made
> it work as I think it SHOULD work:
> 
> ./perl -Ilib -wle '$_ = "�bcdef"; utf8::upgrade($_); print for unpack("a2a3", 
> $_)'
> �b
> cde
> 
> ./perl -Ilib -wle '$_ = chr(8188) . "bcdef"; utf8::upgrade($_); print for 
> unpack("a2a3", $_)'
> Wide character in print at -e line 1.
> ῼb
> cde

-- 
He who stealeth from the poor lendeth to the Lord.
    -- Ulysses

Re: encoding neutral unpack

Reply via email to