Re: Help with a regex to catch spam with gibberish html tags

Andy Jezierski Thu, 30 Jan 2014 08:31:29 -0800

Amir Caspi <[email protected]> wrote on 01/29/2014 11:08:18 AM:

> From: Amir Caspi <[email protected]>
> To: Andy Jezierski <[email protected]>, 
> Cc: "[email protected]" <[email protected]>
> Date: 01/29/2014 11:08 AM
> Subject: Re: Help with a regex to catch spam with gibberish html tags
> 
> On Jan 29, 2014, at 9:53 AM, "Andy Jezierski" <[email protected]> 
wrote:


> I've been noticing a lot of spam getting through with the same 
> traits, a bunch of random words within brackets.  They all seem to 
> come after the </body> or the </html> tag.  Anyone much more 
> knowledgeable than me care to assist with a rule to detect them?
> 
> What about something like:
> 
> rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}
> 
> This will hit on 10 or more consecutive tags separated by nothing 
> but white space. Only single-word tags will hit, so this should 
> minimize FPs from heavy formatting such as nested divs.
> 
> Completely untested, use at your own risk (but post back and tell us
> how well it worked).
> 
> --- Amir
> thumbed via iPhone

That rule seems to be working fine. Has hit on every one of those pesky 
messages so far with no FP's. Will let it run for a while longer before I 
bump up the score.

Thanks
Andy

Re: Help with a regex to catch spam with gibberish html tags

Reply via email to