Re: encoding neutral unpack

Ton Hospel Sun, 30 Jan 2005 07:53:50 -0800

In article <[EMAIL PROTECTED]>,
        Rafael Garcia-Suarez <[EMAIL PROTECTED]> writes:
> Ton Hospel wrote:
>> Take the following three statements:
>>
>> 1) What pack in current perl does for C0 is basically ok:
>> 2) What unpack currently does for C0 is basically ok:
>> 3) pack and unpack should basically be each others reverse
>>    (see how the two above are not)
>>
>> So which one to drop/change ? I don't want to change 1). It's a sane
>> behaviour and it's in fact exactly the one documented. I also don't want
>> to change 3). I think reversibility should be a basic design criterium
>> for pack/unpack. 2) is what is left.
>
> I'd prefer dropping 3), since the pack template mini-language is not
> fully reversible anyway. Complex pack templates with groups and
> repetitions are not applicable as-is to unpack; and, as perlpacktut
> says :
>
>     Be cautioned, however, that not all that has been packed together
>     can be neatly unpacked - a very common experience as seasoned
>     travellers are likely to confirm.
>


Sure, but breaking pack/unpack reversibility here seems rather arbitrary
since we *CAN* have them behave compatible here. Intentionally even
*reversing* the meaning of C0 and U0 seems particularly strange.

>> Another reason to pick statement 2 is that in fact C0 and U0 mode for
>> unpack were never documented.
>>
>> So what should U0 and C0 mean ?
>> Consider the section from the pack docs:
>>
>>        *       If the pattern begins with a "U", the resulting string
>>                will be treated as UTF-8-encoded Unicode. You can force
>>                UTF-8 encoding on in a string with an initial "U0", and
>>                the bytes that follow will be interpreted as Unicode
>>                characters. If you don't want this to happen, you can
>>                begin your pattern with "C0" (or anything else) to
>>                force Perl not to UTF-8 encode your string, and then
>>                follow this with a "U*" somewhere in your pattern.
>>
>> So clearly "C0" mode is meant as the "normal" mode. The packed string
>> is simply a string of bytes, and you pack or unpack byte by byte.
>
> So you're getting the documented meaning of C0 for pack and want to
> apply it to (undocumented) unpack. That's where we're disagreeing, since
> I'd prefer C0 to force unpack producing bytes, just like in current
> perl.
>
>> So for a encoding neutral pack/unpack pair, think of the packed string
>> in it's corresponding upgraded (utf8-expanded) form. C0 and U0 mode are
>> then easily understood:
>>
>> C0 mode: process per character
>> U0 mode: process per underlying byte
>>
>> This formulation is trivially encoding neutral since it only thinks
>> in terms of the upgraded string.
>>
>> I have the impression that many people (including who wrote the current
>> perl unpack code for U0 and C0) think of C0 mode as byte-mode and U0 mode
>> as unicode mode and this is where the confusion comes from. To be compatible
>> with what pack actually does and implements, think of C0 mode as
>> "character mode" and of U0 mode as "utf8-mode" (were you get to see the bytes
>> that make up the UTF8-expansion).
>
> "character" vs "utf8" is definitively confusing, perhaps "encoding" or
> "byte" mode describes U0 better. The perlfunc/pack doc could be improved
> on this.
>
> My point on unpack is that the meaning of U0 and C0 should be reversed,
> since they operate on the opposite sense. Otherwise, I agree with you.
> But I have to admit that with the new meanings of C0 and U0 modes you
> proposed, your new behaviour of unpack makes more sense. I'd still
> prefer to update the docs than to break backwards compatibility (even if
> this was not documented properly.)
>
I can trivially reverse the meaning for U0 and C0 in my patch of course,
and make the default starting mode U0. But it still wouldn't give you what you
want since C0 mode would still work on the (at least conceptually) upgraded
string, and it still wouldn't see "through" the encoding.

I could of course make the (new) C0 be "see through" (dropping the to my
mind also usefull "process the utf8 bytes"), but then you'd STILL not
get unpack("C*", $str) to be the underlying bytes (since we now by default
start in (new) U0 mode), it would have to be unpack("C0C*", $str). So we
can add yet another rule: if the pack format starts with C, we have an
implicit C0, and then unpack("C*", $str) would indeed do what you want.

But we'd have thrown out the baby with the bathwater. Because there is this
basic problem:

- user has some string like "�bc", and he expect unpack("C*", $_) to return
  (224, 98, 99)
1) We want to be encoding neutral, so if the string
   (accidentally) gets upgraded, utf8::upgrade($_); unpack("C*", $_) should
   STILL return (224, 98, 99)
2) We want to be backward compatible, so the upgraded string should return
   the underlying bytes.  utf8::upgrade($_); unpack("C*", $_) should
   return (195, 160, 98, 99)

Notice there was no mention of C0 or U0 modes here. Even so, 1) and 2)
are clearly incompatible.
So we'd have to document that he has to undo the implicit C0 in C* by doing
unpack("U0C*", $_) to get an encoding neutral C*

To me that makes things more icky than breaking backward incompatibity does.
I don't want the user to have to do U0C*, he should just get 1) by default.
Wanting to "see through" the encoding is the non-standard behaviour that
should carry the burden of adding special code.

And deciding that 1) is the right behaviour is enough to need *some*
patches, for example it implies ext/Encode/lib/Encode/MIME/Header.pm needs
a change. Also notice that by writing "U0C*" in these places you get code
that works under both the old perl behaviour and under the behaviour my
patch provides.

So I basically argue:

 1) being "encoding neutral" and "backward compatble (see through)" is
    fundamentally incompatible. And "encoding neutral" is the more
    important one.
 2) We can get "see through encoding" already (and portable to older perls)
    with "use bytes". And in all places it's used to get the utf-8 expansion
    of bytes you can portably use "U0C*" even without "use bytes"
 3) Since "see through" was the main motivation for the current C0 and U0
    meanings anyways, we can just as well change them to the more consistent
    meaning

> (Waiting for the separately sumbitted patches...)

Mm, they were sent before this mail. They should be on the mailinglist
already.

Re: encoding neutral unpack

Reply via email to