Re: Encoding neutral unpack

Ton Hospel Sat, 19 Mar 2005 14:14:03 -0800

In article <[EMAIL PROTECTED]>,
        Rafael Garcia-Suarez <[EMAIL PROTECTED]> writes:
> I applied the patch below to perldelta. Comments welcome.
>
> Change 24035 by [EMAIL PROTECTED] on 2005/03/13 21:14:36
>
>         Document pack changes in perldelta
>
> Affected files ...
>
> ... //depot/perl/pod/perl592delta.pod#3 edit
>
> Differences ...
>
> ==== //depot/perl/pod/perl592delta.pod#3 (text) ====
>
> @@ -10,6 +10,34 @@
>
>  =head1 Incompatible Changes
>
> +=head2 Packing and UTF-8 strings
> +
> +The semantics of pack() and unpack() regarding UTF-8-encoded data has been
> +clarified. B<The character mode is now the default.> Notably, code that
> +uses C<pack("a*", $string)> to see through the encoding of string will now
> +simply return $string.


Maybe:
The semantics of pack() and unpack() regarding UTF-8-encoded data has been
changed. Processing is now by default character per character instead of
byte per byte on the underlying encoding. Notably, code that used things
like C<pack("a*", $string)> to see through the encoding of string will now
simply get back the original $string. Packed strings can also get upgraded
during processing when you store upgraded characters. You can get the old
behaviour by using "use bytes".

> +
> +To be consistent with pack(), the C<C0> in unpack() templates indicates
> +that the data is to be processed in character mode, i.e. character by
> +character; at the contrary, C<U0> in unpack() indicates UTF-8 mode, where
> +the packed string is processed in its UTF-8-encoded Unicode form on a byte
> +by byte basis. This is reversed with regard to perl 5.8.X.
> +
> +Moreover, C<C0> and C<U0> can also be used in pack() templates to specify
> +respectively character and byte modes.
> +
> +C<C0> and C<U0> in the middle of a pack format now switch to the specified
> +encoding mode, honoring parens grouping. Previously, parens were ignored.

C<C0> and C<U0> in the middle of a pack or unpack format now switch to the
specified encoding mode, honoring parens grouping. Previously, parens were
ignored.

> +
> +Also, there is a new pack() character format, C<W>, which is intended to
> +replace the old C<C>. C<C> is kept for unsigned chars coded on eight bits.
> +C<W> represents unsigned character values, which can be greater than 255.
C<C> is kept for unsigned chars coded as bytes in the strings internal 
representation. C<W> represents unsigned (logical) character values, which can 
be greater than 255

> +It is therefore more robust when dealing with potentially UTF-8-encoded
> +data (as C<C> will wrap values outside the range 0..255).
as C<C> will wrap values outside the range 0..255 and not respect the string
encoding).
> +
> +In practice, that means that pack formats are now encoding-neutral, except
> +C<C>.
> +
>  =head1 Core Enhancements
>
>  =head1 Modules and Pragmata

Re: Encoding neutral unpack

Reply via email to