FYI for those using utf8

Tim.

----- Forwarded message from Slaven Rezic <[EMAIL PROTECTED]> -----

Delivered-To: [EMAIL PROTECTED]
Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
X-List-Archive: <http://nntp.perl.org/group/perl.perl5.porters/87191>
Delivered-To: mailing list [EMAIL PROTECTED]
Delivered-To: [EMAIL PROTECTED]
X-Authentication-Warning: vran.herceg.de: eserte set sender to [EMAIL PROTECTED] using 
-f
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Reply-To: [EMAIL PROTECTED]
From: Slaven Rezic <[EMAIL PROTECTED]>
Date: 11 Jan 2004 22:37:42 +0100
In-Reply-To: <[EMAIL PROTECTED]>

Jesse Vincent (via RT) <[EMAIL PROTECTED]> writes:

> # New Ticket Created by  Jesse Vincent 
> # Please include the string:  [perl #24846]
> # in the subject line of all future correspondence about this issue. 
> # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=24846 >
> 
> 
> 
> Yeah, I'm still not quite sure I believe it myself, but IO::Scalar
> exercises join with UTF8 and non-UTF8 data causing RT to end up with
> corrupted attachments fairly often. After patching IO::Scalar to work
> around this by emulating join using concatenation, the issue disappears.
> 
> 

Here's a test case:

use strict;
use Encode qw(is_utf8);
use Test::More qw(no_plan);
my $ascii = "abc\304";
my $utf8  = "abc\x{0100}";
for ($utf8, $ascii) {
    my $res = join("", $_);
    is(is_utf8($res), $_ eq $utf8);
}
__END__

Regards,
        Slaven

-- 
Slaven Rezic - [EMAIL PROTECTED]

    tksm - Perl/Tk program for searching and replacing in multiple files
    http://ptktools.sourceforge.net/#tksm


----- End forwarded message -----
----- Forwarded message from SADAHIRO Tomoyuki <[EMAIL PROTECTED]> -----

Delivered-To: [EMAIL PROTECTED]
Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
X-List-Archive: <http://nntp.perl.org/group/perl.perl5.porters/87211>
Delivered-To: mailing list [EMAIL PROTECTED]
Delivered-To: [EMAIL PROTECTED]
Date: Mon, 12 Jan 2004 11:19:37 +0900
From: SADAHIRO Tomoyuki <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012] 
In-Reply-To: <[EMAIL PROTECTED]>


>       The problem occurs inside join(). join() recycles string objects
>       into which it does the joining, which it later returns. It never
> touches the UTF8 flag on these strings. So, on the initial run, it has
> no strings to recycle (or few), and when they are created they are set
> to ASCII. So all the results of join() are ASCII, which is what MIME and
> RT wants, as ASCII is also what is used for processing binary data. The
> problem is, on the second and subsequent executions of RT within the perl
> system, the recycled strings often have the UTF8 flag set. So, join ('',
> $string), where $string is ASCII, will often return a UTF8 string. When
> this UTF8 string is later converted into ASCII it is modified, and so
> the binary data is corrupted.
> 
>       The solution is to apply the following patch to perl (tested with
>       perl 5.8.2), which sets the UTF8 flag on the returned string to
> something sensible.

This is parhaps due to SvPOK_only_UTF8() in sv_setpv()
which leaves UTF8 flag as it was.

I disagree warning when UTF8 and ASCII are mixed.
I think it would upset encoding.pm
which allows byte strings as in arbitrary encoding
other than the system-native encoding (ASCII/Latin1 or EBCDIC).

### \A patch against perl-5.8.3 RC1
diff -urN perl~/doop.c perl/doop.c
--- perl~/doop.c        Fri Dec 19 05:47:58 2003
+++ perl/doop.c Mon Jan 12 10:08:10 2004
@@ -668,6 +668,10 @@
     }
 
     sv_setpv(sv, "");
+    /* sv_setpv retains old UTF8ness [perl #24846] */
+    if (SvUTF8(sv))
+       SvUTF8_off(sv);
+
     if (PL_tainting && SvMAGICAL(sv))
        SvTAINTED_off(sv);
 
diff -urN perl~/t/op/join.t perl/t/op/join.t
--- perl~/t/op/join.t   Sat Dec 30 16:16:18 2000
+++ perl/t/op/join.t    Mon Jan 12 10:34:22 2004
@@ -1,6 +1,6 @@
 #!./perl
 
-print "1..14\n";
+print "1..18\n";
 
 @x = (1, 2, 3);
 if (join(':',@x) eq '1:2:3') {print "ok 1\n";} else {print "not ok 1\n";}
@@ -65,3 +65,29 @@
   print "ok 14\n";
 }
 
+{ # [perl #24846] $jb2 should be in bytes, not in utf8.
+  my $b = "abc\304";
+  my $u = "abc\x{0100}";
+
+  sub join_into_my_variable {
+    my $r = join("", @_);
+    return $r;
+  }
+
+  my $jb1 = join_into_my_variable("", $b);
+  my $ju1 = join_into_my_variable("", $u);
+  my $jb2 = join_into_my_variable("", $b);
+  my $ju2 = join_into_my_variable("", $u);
+
+  print "not " unless unpack('H*', $jb1) eq unpack('H*', $b);
+  print "ok 15\n";
+
+  print "not " unless unpack('H*', $ju1) eq unpack('H*', $u);
+  print "ok 16\n";
+
+  print "not " unless unpack('H*', $jb2) eq unpack('H*', $b);
+  print "ok 17\n";
+
+  print "not " unless unpack('H*', $ju2) eq unpack('H*', $u);
+  print "ok 18\n";
+}
### \z patch

Regards
SADAHIRO Tomoyuki


----- End forwarded message -----
----- Forwarded message from Rafael Garcia-Suarez <[EMAIL PROTECTED]> -----

Delivered-To: [EMAIL PROTECTED]
Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
X-List-Archive: <http://nntp.perl.org/group/perl.perl5.porters/87218>
Delivered-To: mailing list [EMAIL PROTECTED]
Delivered-To: [EMAIL PROTECTED]
Date: Mon, 12 Jan 2004 11:27:53 +0100
From: Rafael Garcia-Suarez <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
In-Reply-To: <[EMAIL PROTECTED]>

SADAHIRO Tomoyuki wrote:
> I disagree warning when UTF8 and ASCII are mixed.

So do I.

> I think it would upset encoding.pm
> which allows byte strings as in arbitrary encoding
> other than the system-native encoding (ASCII/Latin1 or EBCDIC).
> 
> ### \A patch against perl-5.8.3 RC1
> diff -urN perl~/doop.c perl/doop.c
> --- perl~/doop.c      Fri Dec 19 05:47:58 2003
> +++ perl/doop.c       Mon Jan 12 10:08:10 2004

Thanks, applied to bleadperl as #22117.

----- End forwarded message -----

Reply via email to