Re: HTML link regex

Alexandre Boyer Thu, 27 Sep 2012 10:48:23 -0700

Alex, from prypiat.
Yes, I recycle.


On 12-09-27 11:09 AM, Bowie Bailey wrote:
> On 9/27/2012 10:41 AM, Alexandre Boyer wrote:
>> Hello all,
>>
>> Here is a small ruleset that I'm working with. I added it to our
>> local ruleset in prod:
>>
>>     # BAD LINKS N-NG ;-) ;
>>     # Canada Post
>>                                                                              
>>                                                                              
>>               
>> &n
>>     b sp;
>>     uri_detail   AJB_CANPOST_BADLINK             raw !~ /canadapost\./
>>     text =~ /(?:https?:\/\/(?:www\.)?|www\.)canadapost\./ type =~ /^a$/
>>     describe     AJB_CANPOST_BADLINK             Found a mismatch
>>     between href and anchored text pretending to link to
>> www.canadapost.ca
>>     score        AJB_CANPOST_BADLINK             1.0
>>     meta         AJB_CANPOST_PHISH_BADTRACKNUM   Z_CANPOST_BADLINK &&
>>     !Z_CANPOST_TRACKNUM
>>     describe     AJB_CANPOST_PHISH_BADTRACKNUM   Mismatch between href
>>     and anchored + unofficial tracking number from CanadaPost
>>     score        AJB_CANPOST_PHISH_BADTRACKNUM   2.0
>>     #
>>    
>> youtube                                                                      
>>                                                                              
>>                 
>> &
>>     n bsp;
>>     uri_detail   AJB_UTUBE_BADLINK   raw !~ /youtube\./ text =~
>>     /(?:https?:\/\/(?:www\.)?|www\.)youtube\./ type =~ /^a$/
>>     describe     AJB_UTUBE_BADLINK   Found a mismatch between href and
>>     anchored text pretending to link to www.youtube.com
>>     score        AJB_UTUBE_BADLINK   0.5
>>     # because of link trackers (from massmailer for example), we must
>>     meta this with other rulz to be sure we face our fake yutube botnet
>>     meta      AJB_FK_UTUBE_BOTNET     Z_UTUBE_BADLINK && Z_EMPTY_SUBJ
>>     && MIME_HTML_ONLY
>>     describe  AJB_FK_UTUBE_BOTNET     mismatch between href and
>>     anchored + empty subject = botnet
>>     score     AJB_FK_UTUBE_BOTNET     5.5
>>     ## & nbsp;
>>     # TODO: check if we could workwith  DKIM, exists:List-Unsubscribe,
>>     SPF_PASS, RCVD_IN_RP_SAFE, RCVD_IN_RP_CERTIFIED and others
>>     #    in order to avoid FPs from MassMailers.
>>
>> Note the TODO ;-)
>
> Don't know if it makes much difference in this case, but...
>
> (?:https?:\/\/(?:www\.)?|www\.)

Should catch:
http://
https://
http://www.
https://www.
www.

>
> can be simplified to:
>
> (?:https?:\/\/|www\.)
>

While this catches:
http://
https://
www.

Covering less. It's may be overkill, but my regex has one and only
purpose: match any kind of "valid" web link, as per common user
experience (ie. "as seen on TV").

The spammer will try to lure the common user by mimic what the common
user is habituated to see, no?

> Since you're not anchoring the front of the regexp or trying to
> capture the match, the results will be the same.
>

Not capturing because not using thereafter. On a small system, this
makes no difference. On large systems (millions+ emails filtered a day),
this is probably making a difference. I take a guess here, I don't want
to prove this on my own systems :-)

Alex.

Re: HTML link regex

Reply via email to