FYI for those using utf8 Tim.
----- Forwarded message from Slaven Rezic <[EMAIL PROTECTED]> ----- Delivered-To: [EMAIL PROTECTED] Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm X-List-Archive: <http://nntp.perl.org/group/perl.perl5.porters/87191> Delivered-To: mailing list [EMAIL PROTECTED] Delivered-To: [EMAIL PROTECTED] X-Authentication-Warning: vran.herceg.de: eserte set sender to [EMAIL PROTECTED] using -f To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012] Reply-To: [EMAIL PROTECTED] From: Slaven Rezic <[EMAIL PROTECTED]> Date: 11 Jan 2004 22:37:42 +0100 In-Reply-To: <[EMAIL PROTECTED]> Jesse Vincent (via RT) <[EMAIL PROTECTED]> writes: > # New Ticket Created by Jesse Vincent > # Please include the string: [perl #24846] > # in the subject line of all future correspondence about this issue. > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=24846 > > > > > Yeah, I'm still not quite sure I believe it myself, but IO::Scalar > exercises join with UTF8 and non-UTF8 data causing RT to end up with > corrupted attachments fairly often. After patching IO::Scalar to work > around this by emulating join using concatenation, the issue disappears. > > Here's a test case: use strict; use Encode qw(is_utf8); use Test::More qw(no_plan); my $ascii = "abc\304"; my $utf8 = "abc\x{0100}"; for ($utf8, $ascii) { my $res = join("", $_); is(is_utf8($res), $_ eq $utf8); } __END__ Regards, Slaven -- Slaven Rezic - [EMAIL PROTECTED] tksm - Perl/Tk program for searching and replacing in multiple files http://ptktools.sourceforge.net/#tksm ----- End forwarded message ----- ----- Forwarded message from SADAHIRO Tomoyuki <[EMAIL PROTECTED]> ----- Delivered-To: [EMAIL PROTECTED] Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm X-List-Archive: <http://nntp.perl.org/group/perl.perl5.porters/87211> Delivered-To: mailing list [EMAIL PROTECTED] Delivered-To: [EMAIL PROTECTED] Date: Mon, 12 Jan 2004 11:19:37 +0900 From: SADAHIRO Tomoyuki <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012] In-Reply-To: <[EMAIL PROTECTED]> > The problem occurs inside join(). join() recycles string objects > into which it does the joining, which it later returns. It never > touches the UTF8 flag on these strings. So, on the initial run, it has > no strings to recycle (or few), and when they are created they are set > to ASCII. So all the results of join() are ASCII, which is what MIME and > RT wants, as ASCII is also what is used for processing binary data. The > problem is, on the second and subsequent executions of RT within the perl > system, the recycled strings often have the UTF8 flag set. So, join ('', > $string), where $string is ASCII, will often return a UTF8 string. When > this UTF8 string is later converted into ASCII it is modified, and so > the binary data is corrupted. > > The solution is to apply the following patch to perl (tested with > perl 5.8.2), which sets the UTF8 flag on the returned string to > something sensible. This is parhaps due to SvPOK_only_UTF8() in sv_setpv() which leaves UTF8 flag as it was. I disagree warning when UTF8 and ASCII are mixed. I think it would upset encoding.pm which allows byte strings as in arbitrary encoding other than the system-native encoding (ASCII/Latin1 or EBCDIC). ### \A patch against perl-5.8.3 RC1 diff -urN perl~/doop.c perl/doop.c --- perl~/doop.c Fri Dec 19 05:47:58 2003 +++ perl/doop.c Mon Jan 12 10:08:10 2004 @@ -668,6 +668,10 @@ } sv_setpv(sv, ""); + /* sv_setpv retains old UTF8ness [perl #24846] */ + if (SvUTF8(sv)) + SvUTF8_off(sv); + if (PL_tainting && SvMAGICAL(sv)) SvTAINTED_off(sv); diff -urN perl~/t/op/join.t perl/t/op/join.t --- perl~/t/op/join.t Sat Dec 30 16:16:18 2000 +++ perl/t/op/join.t Mon Jan 12 10:34:22 2004 @@ -1,6 +1,6 @@ #!./perl -print "1..14\n"; +print "1..18\n"; @x = (1, 2, 3); if (join(':',@x) eq '1:2:3') {print "ok 1\n";} else {print "not ok 1\n";} @@ -65,3 +65,29 @@ print "ok 14\n"; } +{ # [perl #24846] $jb2 should be in bytes, not in utf8. + my $b = "abc\304"; + my $u = "abc\x{0100}"; + + sub join_into_my_variable { + my $r = join("", @_); + return $r; + } + + my $jb1 = join_into_my_variable("", $b); + my $ju1 = join_into_my_variable("", $u); + my $jb2 = join_into_my_variable("", $b); + my $ju2 = join_into_my_variable("", $u); + + print "not " unless unpack('H*', $jb1) eq unpack('H*', $b); + print "ok 15\n"; + + print "not " unless unpack('H*', $ju1) eq unpack('H*', $u); + print "ok 16\n"; + + print "not " unless unpack('H*', $jb2) eq unpack('H*', $b); + print "ok 17\n"; + + print "not " unless unpack('H*', $ju2) eq unpack('H*', $u); + print "ok 18\n"; +} ### \z patch Regards SADAHIRO Tomoyuki ----- End forwarded message ----- ----- Forwarded message from Rafael Garcia-Suarez <[EMAIL PROTECTED]> ----- Delivered-To: [EMAIL PROTECTED] Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm X-List-Archive: <http://nntp.perl.org/group/perl.perl5.porters/87218> Delivered-To: mailing list [EMAIL PROTECTED] Delivered-To: [EMAIL PROTECTED] Date: Mon, 12 Jan 2004 11:27:53 +0100 From: Rafael Garcia-Suarez <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012] In-Reply-To: <[EMAIL PROTECTED]> SADAHIRO Tomoyuki wrote: > I disagree warning when UTF8 and ASCII are mixed. So do I. > I think it would upset encoding.pm > which allows byte strings as in arbitrary encoding > other than the system-native encoding (ASCII/Latin1 or EBCDIC). > > ### \A patch against perl-5.8.3 RC1 > diff -urN perl~/doop.c perl/doop.c > --- perl~/doop.c Fri Dec 19 05:47:58 2003 > +++ perl/doop.c Mon Jan 12 10:08:10 2004 Thanks, applied to bleadperl as #22117. ----- End forwarded message -----