Re: encoding neutral unpack

Rafael Garcia-Suarez Mon, 31 Jan 2005 08:10:50 -0800

Ton Hospel wrote:
> I can trivially reverse the meaning for U0 and C0 in my patch of course,
> and make the default starting mode U0. But it still wouldn't give you what you
> want since C0 mode would still work on the (at least conceptually) upgraded
> string, and it still wouldn't see "through" the encoding.
> 
> I could of course make the (new) C0 be "see through" (dropping the to my
> mind also usefull "process the utf8 bytes"), but then you'd STILL not
> get unpack("C*", $str) to be the underlying bytes (since we now by default
> start in (new) U0 mode), it would have to be unpack("C0C*", $str). So we
> can add yet another rule: if the pack format starts with C, we have an
> implicit C0, and then unpack("C*", $str) would indeed do what you want.
> 
> But we'd have thrown out the baby with the bathwater. Because there is this
> basic problem:
> 
> - user has some string like "�bc", and he expect unpack("C*", $_) to return
>   (224, 98, 99)
> 1) We want to be encoding neutral, so if the string
>    (accidentally) gets upgraded, utf8::upgrade($_); unpack("C*", $_) should
>    STILL return (224, 98, 99)
> 2) We want to be backward compatible, so the upgraded string should return
>    the underlying bytes.  utf8::upgrade($_); unpack("C*", $_) should
>    return (195, 160, 98, 99)
> 
> Notice there was no mention of C0 or U0 modes here. Even so, 1) and 2)
> are clearly incompatible.
> So we'd have to document that he has to undo the implicit C0 in C* by doing
> unpack("U0C*", $_) to get an encoding neutral C*


Right.

> To me that makes things more icky than breaking backward incompatibity does.
> I don't want the user to have to do U0C*, he should just get 1) by default.
> Wanting to "see through" the encoding is the non-standard behaviour that
> should carry the burden of adding special code.

However you're appealing to the Rules of Huffman Coding here.
I'm about to be convinced :) if someone else dares to comment...

> And deciding that 1) is the right behaviour is enough to need *some*
> patches, for example it implies ext/Encode/lib/Encode/MIME/Header.pm needs
> a change. Also notice that by writing "U0C*" in these places you get code
> that works under both the old perl behaviour and under the behaviour my
> patch provides.

OK.

> So I basically argue:
> 
>  1) being "encoding neutral" and "backward compatble (see through)" is
>     fundamentally incompatible. And "encoding neutral" is the more
>     important one.
>  2) We can get "see through encoding" already (and portable to older perls)
>     with "use bytes". And in all places it's used to get the utf-8 expansion
>     of bytes you can portably use "U0C*" even without "use bytes"
>  3) Since "see through" was the main motivation for the current C0 and U0
>     meanings anyways, we can just as well change them to the more consistent
>     meaning
> 
> > (Waiting for the separately sumbitted patches...)
> 
> Mm, they were sent before this mail. They should be on the mailinglist
> already.

They are on the archive, but not in my mailbox :( apparently we lost a
few mails this day on this side of internet. I'll look at them soon.

-- 
God wants blood victim. Birth, hymen, martyr, war, foundation of a building,
sacrifice, kidney burntoffering, druids' altars.
    -- Ulysses

Re: encoding neutral unpack

Reply via email to