On Mon, 28 Mar 2016 12:21:17 -0400 Joe Quinn wrote: > On 3/28/2016 11:59 AM, RW wrote: > > On Mon, 28 Mar 2016 09:58:23 -0400 > > Joe Quinn wrote: > > > >> On 3/28/2016 9:55 AM, RW wrote: > >>> Subject =~ /\$\b/ > >> There's no word boundary between the $ and the ' ' because they're > >> both in \W. > > Thanks, I'd forgotten what the definition of a boundary was. > > > > > > I presume that, until spamassassin gets full unicode support, > > non-ascii characters are seen as one or more \W characters. > > So: > > > > " Ångström " > > > > would have boundaries at the points marked by "|" > > > > " Å|ngstr|ö|m| " > > > > split into several words and without a boundary before the Å. > Possibly. Perl's documentation indicates it would work that way if /a > is in effect. Otherwise > (http://perldoc.perl.org/perlrecharclass.html#Word-characters): > > For code points above 255 ... > \w matches the same as \p{Word} matches in this range. That is, > it matches Thai letters, Greek letters, etc. This includes connector > punctuation (like the underscore) which connect two words together, > or diacritics, such as a COMBINING TILDE and the modifier letters, > which are generally used to add auxiliary markings to letters. > For code points below 256 ...
My understanding is that SA works with individual bytes, so in UTF-8 it wouldn't understand anything about the nature of codepoints above 127.