On 3/28/2016 11:59 AM, RW wrote:
On Mon, 28 Mar 2016 09:58:23 -0400
Joe Quinn wrote:

On 3/28/2016 9:55 AM, RW wrote:
    Subject =~ /\$\b/
There's no word boundary between the $ and the ' ' because they're
both in \W.
Thanks, I'd forgotten what the definition of a boundary was.


I presume that, until spamassassin gets full unicode support,
non-ascii characters are seen as one or more \W characters.
So:

    "  Ångström  "

would have boundaries at the points marked by "|"

   " Å|ngstr|ö|m| "

split into several words and without a boundary before the Å.
Possibly. Perl's documentation indicates it would work that way if /a is in effect. Otherwise (http://perldoc.perl.org/perlrecharclass.html#Word-characters):

For code points above 255 ...
\w matches the same as \p{Word} matches in this range. That is, it matches Thai letters, Greek letters, etc. This includes connector punctuation (like the underscore) which connect two words together, or diacritics, such as a COMBINING TILDE and the modifier letters, which are generally used to add auxiliary markings to letters.
For code points below 256 ...
    if locale rules are in effect ...
\w matches the platform's native underscore character plus whatever the locale considers to be alphanumeric.
    if Unicode rules are in effect ...
        \w matches exactly what \p{Word} matches.
    otherwise ...
        \w matches [a-zA-Z0-9_].

It looks like the "Word" property might be a Perl extension to unicode (or at least it's very hard to google), so that's as far as my digging can go into the precise semantics of \w.

Reply via email to