Thanks very much for your further information about this issue.
I'll be happy to file a bug report, but I should also mention that the
problematic behavior not only exists with "use encoding 'utf8'" and "use utf8",
but differs between them. Both produce wrong results, but different wrong
results:
With “use encoding 'utf8'”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is NOT matched by /[\xa0]/
The NBS is NOT matched by /\xa0/
The NBS is NOT matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/
With “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/
With neither âuse encoding 'utf8'â nor âuse utf8â:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is NOT matched by /Â / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/
(The 3rd and 7th patterns, out of 7, should fail.)
(If I include both statements, the behavior is the same as if "use encoding
'utf8'" alone is present. This testing is with "<:encoding(utf8)".)
So, I'm confused as to whether this is 1 bug or more than 1, and how best to
document it (or them). Could you advise me on this?
On 30 Nov 2010, at 10:25, karl williamson wrote:
> Jonathan Pool wrote:
>> Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded
>> text file (so it appears there as C2A0), and I want to match strings that
>> contain this character.
>> I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5)
>> with:
>> use encoding 'utf8';
>> use charnames ':full:';
>> The script opens the file with:
>> open FH, '<:utf8', filename.txt;
>
> You should always use '<:encoding(utf8)' instead to get utf8 validation.
> But that's not the problem here.
> I tested it on the very latest development code, and it still fails. The
> problem is a bug or bugs in Perl with parsing files encoded in utf8. I
> converted the .pl to latin1 and removed the "use encoding 'utf8'", and it
> works.
>
> I believe it is known that there are issues with 'use encoding', but I
> suggest filing a bug report, by sending email to [email protected]. Attached
> are two files I created to test. These should be attached to the bug report
> so as to not have to be done again.
>> It reads lines in with:
>> while <FH> {}
>> Then, in a regular expression in the script, I can match the NO-BREAK SPACE
>> with any of these patterns:
>> 1. /\N{NO-BREAK SPACE}/
>> 2. / / (where the character between slashes looks like a space but is a
>> no-break space)
>> 3. /[\x7f-\x80]/
>> Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the
>> range specified in pattern 3 includes DELETE and an unnamed character but
>> does not include NO-BREAK SPACE.
>> Moreover, I expect to be able to match the NO-BREAK SPACE with these
>> patterns, but I cannot:
>> 4. /[\xa0]/
>> 5. /\xa0/
>> In the related documentation, I have not found anything explaining why
>> pattern 3 works, or anything explaining why patterns 4 and 5 do not work.
>> I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise
>> Linux 5.
>> I would be delighted to receive explanations or references to documentation
>> that I have overlooked or misunderstood.
> <nobreak_latin1.pl><nobreak_utf8.pl>
ˉ