Hi,
I've posted recently on this subject and got some good responses. But I see a problem with SA that should probably be fixed.

When checking uri, SA checks both anchor tags AND body for uri. That's OK. For example, <a href="joe.com">Joe</a> is checked and http://joe.com (if found without anchor tags in the body) is also checked.

The problem arises when in the body the uri is surrounded by extraneous characters like >,.;:#*!?)] Some of this is spammy nonsense but sometimes it's legitimate punctuation. Either way, SA doesn't halt checking the uri before the extraneous ending character in this case. To circumvent this, I've written some tests that some may find useful. Please feel free to identify flaw or inefficiency.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

uri __LINK_SUSPECT_CJ m{^(?>http://[^/,";'\s:;?!)\#>\]]*)(?<!com|net|org|gov|edu|mil|\.us)}i uri __LINK_DOT_CJ m{^(?>http://[^/,";'\s:;?!)\#>\]]*)(?<=com\.|net\.|org\.|gov\.|edu\.|mil\.|\.us\.)}i

meta LINK_FOREIGN_CJ ( __LINK_SUSPECT_CJ && ! __LINK_DOT_CJ )
score LINK_FOREIGN_CJ <score>

uri __MAIL_SUSPECT_CJ m{^(?>mailto:[^\s,>?!\#);:'"\]]*)(?<!com|net|org|gov|\.us|edu|mil)}i uri __MAIL_DOT_CJ m{^(?>mailto:[^\s,>?!\#);:'"\]]*)(?<=com\.|net\.|org\.|gov\.|\.us\.|edu\.|mil\.)}i

meta MAIL_FOREIGN_CJ ( __MAIL_SUSPECT_CJ && ! __MAIL_DOT_CJ )
score MAIL_FOREIGN_CJ <score>

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The DOT tests are needed in this case: http://joe.com. Interestingly, SA uri tests will assume the existence of mailto: and http: in the body so they are not really necessary but adding them doesn't change the test. Thus in the body http://www.joe.za will match as will www.joe.za but obviously joe.za doesn't match at all or we'd be in big trouble. Along the same line, [EMAIL PROTECTED] matches in the body.

NOTE: I am not recommending these rules to everyone nor do we need tests run for match efficiency. I do know that in some situations, these tests can be extremely valuable (like mine) where 99% of email not from these top level domains is spam, and more specifically, very hard to tag spam -- like those silly Nigerian body-like emails or the very short text-only spam like this:

You know you want it. http://joe.za

Hope this is useful and thanks for the comments,
Craig Jackson

Reply via email to