Re: Regex problem

RW Mon, 28 Mar 2016 09:59:00 -0700

On Mon, 28 Mar 2016 12:21:17 -0400
Joe Quinn wrote:

> On 3/28/2016 11:59 AM, RW wrote:
> > On Mon, 28 Mar 2016 09:58:23 -0400
> > Joe Quinn wrote:
> >  
> >> On 3/28/2016 9:55 AM, RW wrote:  
> >>>     Subject =~ /\$\b/  
> >> There's no word boundary between the $ and the ' ' because they're
> >> both in \W.  
> > Thanks, I'd forgotten what the definition of a boundary was.
> >
> >
> > I presume that, until spamassassin gets full unicode support,
> > non-ascii characters are seen as one or more \W characters.
> > So:
> >
> >     "  Ångström  "
> >
> > would have boundaries at the points marked by "|"
> >
> >    " Å|ngstr|ö|m| "
> >
> > split into several words and without a boundary before the Å.  
> Possibly. Perl's documentation indicates it would work that way if /a
> is in effect. Otherwise 
> (http://perldoc.perl.org/perlrecharclass.html#Word-characters):
> 
> For code points above 255 ...
>      \w matches the same as \p{Word} matches in this range. That is,
> it matches Thai letters, Greek letters, etc. This includes connector 
> punctuation (like the underscore) which connect two words together,
> or diacritics, such as a COMBINING TILDE and the modifier letters,
> which are generally used to add auxiliary markings to letters.
> For code points below 256 ...


My understanding is that SA works with individual bytes, so in UTF-8 it
wouldn't understand anything about the nature of codepoints above
127.

Re: Regex problem

Reply via email to