Ton Hospel wrote:
> I can trivially reverse the meaning for U0 and C0 in my patch of course,
> and make the default starting mode U0. But it still wouldn't give you what you
> want since C0 mode would still work on the (at least conceptually) upgraded
> string, and it still wouldn't see "through" the encoding.
>
> I could of course make the (new) C0 be "see through" (dropping the to my
> mind also usefull "process the utf8 bytes"), but then you'd STILL not
> get unpack("C*", $str) to be the underlying bytes (since we now by default
> start in (new) U0 mode), it would have to be unpack("C0C*", $str). So we
> can add yet another rule: if the pack format starts with C, we have an
> implicit C0, and then unpack("C*", $str) would indeed do what you want.
>
> But we'd have thrown out the baby with the bathwater. Because there is this
> basic problem:
>
> - user has some string like "�bc", and he expect unpack("C*", $_) to return
> (224, 98, 99)
> 1) We want to be encoding neutral, so if the string
> (accidentally) gets upgraded, utf8::upgrade($_); unpack("C*", $_) should
> STILL return (224, 98, 99)
> 2) We want to be backward compatible, so the upgraded string should return
> the underlying bytes. utf8::upgrade($_); unpack("C*", $_) should
> return (195, 160, 98, 99)
>
> Notice there was no mention of C0 or U0 modes here. Even so, 1) and 2)
> are clearly incompatible.
> So we'd have to document that he has to undo the implicit C0 in C* by doing
> unpack("U0C*", $_) to get an encoding neutral C*
Right.
> To me that makes things more icky than breaking backward incompatibity does.
> I don't want the user to have to do U0C*, he should just get 1) by default.
> Wanting to "see through" the encoding is the non-standard behaviour that
> should carry the burden of adding special code.
However you're appealing to the Rules of Huffman Coding here.
I'm about to be convinced :) if someone else dares to comment...
> And deciding that 1) is the right behaviour is enough to need *some*
> patches, for example it implies ext/Encode/lib/Encode/MIME/Header.pm needs
> a change. Also notice that by writing "U0C*" in these places you get code
> that works under both the old perl behaviour and under the behaviour my
> patch provides.
OK.
> So I basically argue:
>
> 1) being "encoding neutral" and "backward compatble (see through)" is
> fundamentally incompatible. And "encoding neutral" is the more
> important one.
> 2) We can get "see through encoding" already (and portable to older perls)
> with "use bytes". And in all places it's used to get the utf-8 expansion
> of bytes you can portably use "U0C*" even without "use bytes"
> 3) Since "see through" was the main motivation for the current C0 and U0
> meanings anyways, we can just as well change them to the more consistent
> meaning
>
> > (Waiting for the separately sumbitted patches...)
>
> Mm, they were sent before this mail. They should be on the mailinglist
> already.
They are on the archive, but not in my mailbox :( apparently we lost a
few mails this day on this side of internet. I'll look at them soon.
--
God wants blood victim. Birth, hymen, martyr, war, foundation of a building,
sacrifice, kidney burntoffering, druids' altars.
-- Ulysses