Re: encoding neutral unpack

Rafael Garcia-Suarez Thu, 27 Jan 2005 15:32:38 -0800

Ton Hospel wrote in perl.perl5.porters :
>
> This patch that makes unpack completely independent from whether the string 
> happens to be upgraded or not.


As Nicholas is working on pp_pack currently, your patches will need
adaptations. Moreover it touches Encode, which is supposed to work on
5.8.x perls.

Further comments below.

> Things to note:
>   - The trick of using unpack("C*", $string) to see "through" the encoding
>     doesn't work anymore 
>     (which is as it should be if unpack is encoding-neutral)
>     You can run unpack under "use bytes" to get the old effect though.

What about forcing with C0 ?

>   - The meaning of U0 and C0 were sort of swapped for unpack. Consider:
>      with current perl:
>       perl -wle 'print for unpack("U0U*", "\341\277\274")'
>       8188
>       perl -wle 'print for unpack("C0U*", "\341\277\274")'
>       225
>       191
>       188

I don't agree with this. I'm not a heavy pack user, but to me the
current way seems more natural. The semantics you propose for unpack
below look weird (other may care to comment) :

>      Since the pack behaviour looks right and is the one documented, I 
>      conclude that the semantics for C0 and U0 in unpack are wrong (reversed).
>      With the patch they behave like what I think is right:
>
>      ./perl -Ilib -wle 'print for unpack("C0U*", "\341\277\274")'
>      8188
>      ./perl -Ilib -wle 'print for unpack("U0U*", "\341\277\274")'
>      225
>      191
>      188
>
>     Funny enough I only had to change one test for this 
>     (perl-dev/t/uni/case.pl), which is not even a core pack test.

Lack of testing, ungood. In one case or another this would need to be
patched.

>   - For the moment I made unpack "C" on a char >= 256 be an error, though
>     later on I'd like to make it just basically do ord(). C is however
>     the most likely format that users will notice the changed semantics, so 
>     for a first version it might be good to keep it like this so users can
>     flush out errors. (since A, a and Z checksumming is currently delegated
>     to "C", that also implies that these won't checksum chars >= 256)

>   - Fixed a minor bug in that checksumming over a partial byte in 'B' formats
>     forgot to advance the pointer (needs to be applied to maint too)

Sumbit it separately.

>   - U0 or starting with U on an octet string is completely yucky to implement
>     as a special case in all formats, so I do these by upgrading a temporary 
>     string. This is inefficient if you only unpack few things from a huge
>     string (fortunately this case shouldn't be normal. pack with an initial
>     U would return an already upgraded string, so this would only happen if
>     the string later got somehow degraded). Still might be worth warning about
>     in the docs.

OK.

>   - If the first character in the unpackstring is U it now behaves like
>     there is an U0 before (this is consistent with the documentation
>     and behaviour of pack, and needed for reversibility)

This seems to contradict the following :

>   - Only literal C0 and U0 cause a mode switch, not ones implied by
>     something like: unpack("C/U", "\x00")
>     (I consider that a bugfix that should also be applied to maint)

Sounds likely; sumbit it separately.

>   - C0 and U0 modes are scoped to (), so in format "s(nU0v)2S", the U0 mode
>     only applies to the v, NOT to the S or the n (in the second round)
>     (I also consider the old behaviour here a bug. It made multiround
>     groups and C/() style groups too unpredictable)

Right. I don't know whether this is feasible to submit it separately,
though.

>   - a,A,Z now conserve unicodeness when extracting from an unicode packed 
>     string (can't happen (currently) if the string was constructed with
>     pack).
>     e.g.:
>
>     ./perl -Ilib -wle 'use Devel::Peek; Dump(unpack("a*", pack("U*", 8188)))'
>     SV = PV(0x8162ef8) at 0x8162b00
>     REFCNT = 1
>     FLAGS = (TEMP,POK,pPOK,UTF8)
>     PV = 0x8167bf0 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
>     CUR = 3
>     LEN = 4
>
>     It could be argued that I should try to do sv_utf8_downgrade(sv, 1)
>     before returning, but I prefer the current code. It allows proper 
>     fixed format parsing of unicode (and byte) strings.

How so ? why has it changed ? a side effect of the changes to U formats ?

>   - "A" stripped both NULL and any whitespace. I change it to NULL and space
>     (documented as such, makes it more reversible from pack and I didn't 
>     want to have to deal with the extended unicode definition of what a 
>     space is)

>   - Relatively few tests and code in perl itself need to be updated.
>     The ugliest one is in lib/CGI.Util,pm (what Util.pm is doing there
>     is utterly broken anyways, I left it for compatibility. However, what
>     it really should do is utf8::downgrade or use %u escapes I think)
>
> Probably this has too many semantics changes to apply it to to maint 
> (though I think that all places where it makes a difference are bugs really), 
> but I think it would be proper for blead.

-- 
You probably wouldn't have expected a communist to have a dog named Harpo.
    -- Malcolm Lowry, Under the Volcano

Re: encoding neutral unpack

Reply via email to