Re: Help with a regex to catch spam with gibberish html tags

Kevin A. McGrail Thu, 30 Jan 2014 09:29:54 -0800

On 1/30/2014 11:23 AM, Andy Jezierski wrote:

Amir Caspi <[email protected]> wrote on 01/29/2014 11:08:18 AM:


> From: Amir Caspi <[email protected]>
> To: Andy Jezierski <[email protected]>,
> Cc: "[email protected]" <[email protected]>
> Date: 01/29/2014 11:08 AM
> Subject: Re: Help with a regex to catch spam with gibberish html tags
>

> On Jan 29, 2014, at 9:53 AM, "Andy Jezierski"<[email protected]> wrote:


> I've been noticing a lot of spam getting through with the same
> traits, a bunch of random words within brackets.  They all seem to
> come after the </body> or the </html> tag.  Anyone much more
> knowledgeable than me care to assist with a rule to detect them?
>
> What about something like:
>
> rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}
>
> This will hit on 10 or more consecutive tags separated by nothing
> but white space. Only single-word tags will hit, so this should
> minimize FPs from heavy formatting such as nested divs.
>
> Completely untested, use at your own risk (but post back and tell us
> how well it worked).
>
> --- Amir
> thumbed via iPhone

That rule seems to be working fine. Has hit on every one of thosepesky messages so far with no FP's. Will let it run for a while longerbefore I bump up the score.

If you want to share the complete rule, I can throw it into my sandboxand see what masscheck thinks as well.


regards,
KAM

Re: Help with a regex to catch spam with gibberish html tags

Reply via email to