Re: Regex problem

Joe Quinn Mon, 28 Mar 2016 09:22:09 -0700

On 3/28/2016 11:59 AM, RW wrote:

On Mon, 28 Mar 2016 09:58:23 -0400
Joe Quinn wrote:

On 3/28/2016 9:55 AM, RW wrote:

    Subject =~ /\$\b/

There's no word boundary between the $ and the ' ' because they're
both in \W.

Thanks, I'd forgotten what the definition of a boundary was.


I presume that, until spamassassin gets full unicode support,
non-ascii characters are seen as one or more \W characters.
So:

    "  Ångström  "

would have boundaries at the points marked by "|"

   " Å|ngstr|ö|m| "

split into several words and without a boundary before the Å.

Possibly. Perl's documentation indicates it would work that way if /a isin effect. Otherwise(http://perldoc.perl.org/perlrecharclass.html#Word-characters):


For code points above 255 ...

\w matches the same as \p{Word} matches in this range. That is, itmatches Thai letters, Greek letters, etc. This includes connectorpunctuation (like the underscore) which connect two words together, ordiacritics, such as a COMBINING TILDE and the modifier letters, whichare generally used to add auxiliary markings to letters.

For code points below 256 ...
    if locale rules are in effect ...

\w matches the platform's native underscore character pluswhatever the locale considers to be alphanumeric.

    if Unicode rules are in effect ...
        \w matches exactly what \p{Word} matches.
    otherwise ...
        \w matches [a-zA-Z0-9_].

It looks like the "Word" property might be a Perl extension to unicode(or at least it's very hard to google), so that's as far as my diggingcan go into the precise semantics of \w.

Re: Regex problem

Reply via email to