I noticed that VERY_SUSP_RECIPS and VERY_SUSP_CC_RECIPS were failing to
match in some cases they should, and matching in some they shouldn't.

/\b([a-z][a-z])[^@]{0,20}(@[-a-z0-9_\.]{0,30}).{0,30}?(?:\1[^@]*\2.{0,20}?){9,}/is

- Sequences such as "[EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], ..." matched.
(This should match SUSPICIOUS_RECIPS, but not VERY_SUSP_RECIPS.)  I think
both occurrences of [^@] should be [^@,] to prevent it swallowing commas
and usernames as part of the hostname, then mistaking a hostname as a
username.  Possibly it should also exclude parens and angle brackets.

- I added \b before \1 to keep it from finding the repeated 2 character
sequence other than at the beginning of the username.

- Long hostnames caused failures.  I changed \2.{0,20}? to \2.{0,30}?
Obviously that could be better.  

- I saw a number of spams with 8 or 9 repetitions, so I'm now using {7,}
instead of {9,}  (If/when rules can have variable scores, a possibly
worthwhile enhancement would be to make this score proportional to the
number of repetitions.)

The result is:

/\b([a-z][a-z])[^@,]{0,20}(@[-a-z0-9_\.]{0,30}).{0,30}?(?:\b\1[^@,]*\2.{0,30}?){7,}/i

This fixes both problems and works on all my tests, but I'm not 100%
confident I haven't broken something.  I'm assuming the intent was that
"similar usernames" mean "similar initial substrings".  If not I've
certainly broken something, but as it was it was matching lists that had no
real similarities in the usernames.

Tom

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to