On Mon, Aug 08, 2005 at 04:01:05PM -0700, Christopher J. Madsen wrote: > # New Ticket Created by "Christopher J. Madsen" > # Please include the string: [perl #36839] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/rt3/Ticket/Display.html?id=36839 > > > > This is a bug report for perl from [EMAIL PROTECTED], > generated with the help of perlbug 1.35 running under perl v5.8.6. > > > ----------------------------------------------------------------- > [Please enter your report here] > > I'm not clear on whether the non-breaking space character (\xA0) is > supposed to match \s or not, but it certainly shouldn't depend on what > other characters are in the string. So this test case:
No, I agree that it should not. However, this is *the* unfixable UTF-8 bug in Perl 5 - the fact that 1 bit is used as a flag that both signals "buffer is encoded as UTF-8" and "string should use Unicode rather than bytes semantics" (bytes being ASCII or EBCDIC) It all comes down to the internal representation that the scalar happens to have: $ ./perl -Ilib $_ = "\xA0"; print "Bytes\n" if /\s/; utf8::upgrade $_; print "UTF-8\n" if /\s/; __END__ UTF-8 [well, it might be fixable long term, but an incomplete list of requirements would be: 1: finding another flag bit in every SV 2: a deprecation cycle in 5.10 so it wouldn't be done until 5.12] I think that the best work around would be to normalise all the data you process to UTF-8 representation internally. This will make \s consistent. Nicholas Clark