Re: [perl #36839] \xA0 (non-breaking space) does and doesn't match \s

Nicholas Clark Tue, 09 Aug 2005 02:38:23 -0700

On Mon, Aug 08, 2005 at 04:01:05PM -0700, Christopher J. Madsen wrote:
> # New Ticket Created by  "Christopher J. Madsen" 
> # Please include the string:  [perl #36839]
> # in the subject line of all future correspondence about this issue. 
> # <URL: https://rt.perl.org/rt3/Ticket/Display.html?id=36839 >
> 
> 
> This is a bug report for perl from [EMAIL PROTECTED],
> generated with the help of perlbug 1.35 running under perl v5.8.6.
> 
> 
> -----------------------------------------------------------------
> [Please enter your report here]
> 
> I'm not clear on whether the non-breaking space character (\xA0) is
> supposed to match \s or not, but it certainly shouldn't depend on what
> other characters are in the string.  So this test case:


No, I agree that it should not.

However, this is *the* unfixable UTF-8 bug in Perl 5 - the fact that 1 bit
is used as a flag that both signals "buffer is encoded as UTF-8" and
"string should use Unicode rather than bytes semantics"

(bytes being ASCII or EBCDIC)

It all comes down to the internal representation that the scalar happens to
have:

$ ./perl -Ilib
$_ = "\xA0";
print "Bytes\n" if /\s/;
utf8::upgrade $_;
print "UTF-8\n" if /\s/;
__END__
UTF-8



[well, it might be fixable long term, but an incomplete list of requirements
would be:

1: finding another flag bit in every SV
2: a deprecation cycle in 5.10

so it wouldn't be done until 5.12]


I think that the best work around would be to normalise all the data you
process to UTF-8 representation internally. This will make \s consistent.

Nicholas Clark

Re: [perl #36839] \xA0 (non-breaking space) does and doesn't match \s

Reply via email to