[Bug 3191] Word boundaries are lost after HTML processing

bugzilla-daemon 18 Mar 2004 19:40:51 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3191






------- Additional Comments From [EMAIL PROTECTED]  2004-03-18 11:40 -------
This one is easy to explain: a word boundary is any word char [\w] followed by a
non-word char [\W] or the other way around... so \w\W or \W\w.  An accented
character is NOT part of the \w class, therefore "� " doesn't count as "\w\W".

I worked around this in my obfu rule generator (http://sandgnat.com/cmos/) by
using an "or grouping" when matching word boundaries.  See how the rules
generated by http://sandgnat.com/cmos/cmos.jsp?words=foo (which is based on the
regexp /\bfoo\b/) have this pattern embedded at the very end: 

(?:[o0]\b|(?:[\*\xB0\xBA\xD8\xF8\xD2-\xD6\xF2-\xF6]|\(\)|\[\]|\xC5[\x8C-\x91]|\xC6[\xA0-\xA1]|\xC7[\x91-\x92]|\xC7[\xBE-\xBF]|\xCE\x8C|\xCE\x98|\xCE\x9F|\xCE\xB8|\xCE\xBF|\xCF\x8C|\xD0\x9E|\xD0\xBE|\xD5\x95)\B)

That regexp snippet matches the "o\b" part of /\bfoo\b/ including all the
accented versions  (My script by default doesn't print the literal accented
values, but instead the escaped versions, such as "\xB0", because some browsers
have issues w/copy/pasting them) and also multi-byte characters (which
HTML::Entities generates from &xxx; entities)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3191] Word boundaries are lost after HTML processing

Reply via email to