Ton Hospel wrote in perl.perl5.porters :
>
> This patch that makes unpack completely independent from whether the string
> happens to be upgraded or not.
As Nicholas is working on pp_pack currently, your patches will need
adaptations. Moreover it touches Encode, which is supposed to work on
5.8.x perls.
Further comments below.
> Things to note:
> - The trick of using unpack("C*", $string) to see "through" the encoding
> doesn't work anymore
> (which is as it should be if unpack is encoding-neutral)
> You can run unpack under "use bytes" to get the old effect though.
What about forcing with C0 ?
> - The meaning of U0 and C0 were sort of swapped for unpack. Consider:
> with current perl:
> perl -wle 'print for unpack("U0U*", "\341\277\274")'
> 8188
> perl -wle 'print for unpack("C0U*", "\341\277\274")'
> 225
> 191
> 188
I don't agree with this. I'm not a heavy pack user, but to me the
current way seems more natural. The semantics you propose for unpack
below look weird (other may care to comment) :
> Since the pack behaviour looks right and is the one documented, I
> conclude that the semantics for C0 and U0 in unpack are wrong (reversed).
> With the patch they behave like what I think is right:
>
> ./perl -Ilib -wle 'print for unpack("C0U*", "\341\277\274")'
> 8188
> ./perl -Ilib -wle 'print for unpack("U0U*", "\341\277\274")'
> 225
> 191
> 188
>
> Funny enough I only had to change one test for this
> (perl-dev/t/uni/case.pl), which is not even a core pack test.
Lack of testing, ungood. In one case or another this would need to be
patched.
> - For the moment I made unpack "C" on a char >= 256 be an error, though
> later on I'd like to make it just basically do ord(). C is however
> the most likely format that users will notice the changed semantics, so
> for a first version it might be good to keep it like this so users can
> flush out errors. (since A, a and Z checksumming is currently delegated
> to "C", that also implies that these won't checksum chars >= 256)
> - Fixed a minor bug in that checksumming over a partial byte in 'B' formats
> forgot to advance the pointer (needs to be applied to maint too)
Sumbit it separately.
> - U0 or starting with U on an octet string is completely yucky to implement
> as a special case in all formats, so I do these by upgrading a temporary
> string. This is inefficient if you only unpack few things from a huge
> string (fortunately this case shouldn't be normal. pack with an initial
> U would return an already upgraded string, so this would only happen if
> the string later got somehow degraded). Still might be worth warning about
> in the docs.
OK.
> - If the first character in the unpackstring is U it now behaves like
> there is an U0 before (this is consistent with the documentation
> and behaviour of pack, and needed for reversibility)
This seems to contradict the following :
> - Only literal C0 and U0 cause a mode switch, not ones implied by
> something like: unpack("C/U", "\x00")
> (I consider that a bugfix that should also be applied to maint)
Sounds likely; sumbit it separately.
> - C0 and U0 modes are scoped to (), so in format "s(nU0v)2S", the U0 mode
> only applies to the v, NOT to the S or the n (in the second round)
> (I also consider the old behaviour here a bug. It made multiround
> groups and C/() style groups too unpredictable)
Right. I don't know whether this is feasible to submit it separately,
though.
> - a,A,Z now conserve unicodeness when extracting from an unicode packed
> string (can't happen (currently) if the string was constructed with
> pack).
> e.g.:
>
> ./perl -Ilib -wle 'use Devel::Peek; Dump(unpack("a*", pack("U*", 8188)))'
> SV = PV(0x8162ef8) at 0x8162b00
> REFCNT = 1
> FLAGS = (TEMP,POK,pPOK,UTF8)
> PV = 0x8167bf0 "\341\277\274"\0 [UTF8 "\x{1ffc}"]
> CUR = 3
> LEN = 4
>
> It could be argued that I should try to do sv_utf8_downgrade(sv, 1)
> before returning, but I prefer the current code. It allows proper
> fixed format parsing of unicode (and byte) strings.
How so ? why has it changed ? a side effect of the changes to U formats ?
> - "A" stripped both NULL and any whitespace. I change it to NULL and space
> (documented as such, makes it more reversible from pack and I didn't
> want to have to deal with the extended unicode definition of what a
> space is)
> - Relatively few tests and code in perl itself need to be updated.
> The ugliest one is in lib/CGI.Util,pm (what Util.pm is doing there
> is utterly broken anyways, I left it for compatibility. However, what
> it really should do is utf8::downgrade or use %u escapes I think)
>
> Probably this has too many semantics changes to apply it to to maint
> (though I think that all places where it makes a difference are bugs really),
> but I think it would be proper for blead.
--
You probably wouldn't have expected a communist to have a dog named Harpo.
-- Malcolm Lowry, Under the Volcano