Re: character setts in a regexp

Charles DeRykus Sat, 12 Jan 2013 12:57:36 -0800

On Fri, Jan 11, 2013 at 2:01 PM, Christer Palm <b...@bredband.net> wrote:
> Hi!
>
> I have a perl script that parses RSS streams from different news sources and 
> experience problems with national characters in a regexp function used for 
> matching a keyword list with the RSS data.
>
> Everything works fine with a simple regexp for plain english i.e. words 
> containing the letters A-Z, a-z, 0-9.
>
> if ( $description =~ m/\b$key/i ) {….}
>
> Keywords or RSS data with national characters don’t work at all. I’m not 
> really surprised this was expected as character sets used in the different 
> RSS streams are outside my control.
>
> I am have the ”use utf8;” function activated but I’m not really sure if it is 
> needed. I can’t see any difference used or not.
>
> If a convert all the national characters used in the keyword list to html 
> type ”&aring” and so on. Changes every occurrence of octal, unicode 
> characters used i.e. decimal and hex to html type in the RSS data in a 
> character parser everything works fine but takes time that I don’t what to 
> avoid.
>
> Do you have suggestions on this character issue? Is it possible to determine 
> the character set of a text efficiently? Is it other ways to solve the 
> problem?
>
> /Christer
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>

On Fri, Jan 11, 2013 at 2:01 PM, Christer Palm <b...@bredband.net> wrote:
> Hi!
>
> I have a perl script that parses RSS streams from different news sources and 
> experience problems with national characters in a regexp function used for 
> matching a keyword list with the RSS data.
>
> Everything works fine with a simple regexp for plain english i.e. words 
> containing the letters A-Z, a-z, 0-9.
>
> if ( $description =~ m/\b$key/i ) {….}
>
> Keywords or RSS data with national characters don’t work at all. I’m not 
> really surprised this was expected as character sets used in the different 
> RSS streams are outside my control.
>
> I am have the ”use utf8;” function activated but I’m not really sure if it is 
> needed. I can’t see any difference used or not.
>
> If a convert all the national characters used in the keyword list to html 
> type ”&aring” and so on. Changes every occurrence of octal, unicode 
> characters used i.e. decimal and hex to html type in the RSS data in a 
> character parser everything works fine but takes time that I don’t what to 
> avoid.
>
> Do you have suggestions on this character issue? Is it possible to determine 
> the character set of a text efficiently? Is it other ways to solve the 
> problem?
>

I'm not sure if this is related but the docs mention some character
and byte semantics overlap.

*** START perlunicode:
..As discussed elsewhere, Perl has one foot (two hooves?) planted in
each of two worlds: the old world of bytes and the new world of
characters, upgrading from bytes to characters when necessary. If your
legacy code does not explicitly use Unicode, no automatic switch-over
to characters should happen. Characters shouldn't get downgraded to
bytes, either. It is possible to accidentally mix bytes and
characters, however (see perluniintro), in which case \w in regular
expressions might start behaving differently (unless the /a modifier
is in effect). Review your code. Use warnings and the strict pragma.
*** END perlunicode

<speculate>
Perhaps, although not explicit,  this downgrading might potentially
impact \b as well as \w.   Here's an example which appears to
support this  since adding \b causes the match to fail.  (There may
workaround via the character properties mentioned in perlunicode)
</speculate>

#!/usr/bin/perl
use strict;
use warnings;

binmode(STDOUT, ":utf8");
$cosa = "my \x{263a}";
print "cosa=$cosa\n";

print "found smiley at \\b\n" if $cosa =~ /\b\x{263a}/;
print "found smiley (no \\b)"  if $cosa =~ /\x{263a}/;

The output:
cosa=my ☺
found smiley (no \b)

-- 
Charles DeRykus

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: character setts in a regexp

Reply via email to