Re: Matching upper ASCII characters in RE patterns

karl williamson Tue, 30 Nov 2010 11:10:30 -0800

karl williamson wrote:

Jonathan Pool wrote:
Let's say the character NO-BREAK SPACE (U+00A0) appears in aUTF8-encoded text file (so it appears there as C2A0), and I want tomatch strings that contain this character.
I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X10.6.5) with:
use encoding 'utf8';
use charnames ':full:';

The script opens the file with:

open FH, '<:utf8', filename.txt;
You should always use '<:encoding(utf8)' instead to get utf8 validation.
But that's not the problem here.
I tested it on the very latest development code, and it still fails. Theproblem is a bug or bugs in Perl with parsing files encoded in utf8. Iconverted the .pl to latin1 and removed the "use encoding 'utf8'", andit works.
I believe it is known that there are issues with 'use encoding', but Isuggest filing a bug report, by sending email to [email protected].Attached are two files I created to test. These should be attached tothe bug report so as to not have to be done again.

I thought about it some more, and replaced the "use encoding 'utf8'"with just "use utf8", and it also works there

It reads lines in with:

while <FH> {}
Then, in a regular expression in the script, I can match the NO-BREAKSPACE with any of these patterns:
1. /\N{NO-BREAK SPACE}/
2. / / (where the character between slashes looks like a space but isa no-break space)
3. /[\x7f-\x80]/
Patterns 1 and 2 make sense, but pattern 3 is mysterious to me,because the range specified in pattern 3 includes DELETE and anunnamed character but does not include NO-BREAK SPACE.
Moreover, I expect to be able to match the NO-BREAK SPACE with thesepatterns, but I cannot:
4. /[\xa0]/

5. /\xa0/
In the related documentation, I have not found anything explaining whypattern 3 works, or anything explaining why patterns 4 and 5 do not work.
I have replicated these anomalies in Perl 5.8.8. under Red HatEnterprise Linux 5.
I would be delighted to receive explanations or references todocumentation that I have overlooked or misunderstood.
ˉ

Re: Matching upper ASCII characters in RE patterns

Reply via email to