Amir Caspi <[email protected]> wrote on 01/29/2014 11:08:18 AM:
> From: Amir Caspi <[email protected]>
> To: Andy Jezierski <[email protected]>,
> Cc: "[email protected]" <[email protected]>
> Date: 01/29/2014 11:08 AM
> Subject: Re: Help with a regex to catch spam with gibberish html tags
>
> On Jan 29, 2014, at 9:53 AM, "Andy Jezierski" <[email protected]>
wrote:
> I've been noticing a lot of spam getting through with the same
> traits, a bunch of random words within brackets. They all seem to
> come after the </body> or the </html> tag. Anyone much more
> knowledgeable than me care to assist with a rule to detect them?
>
> What about something like:
>
> rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}
>
> This will hit on 10 or more consecutive tags separated by nothing
> but white space. Only single-word tags will hit, so this should
> minimize FPs from heavy formatting such as nested divs.
>
> Completely untested, use at your own risk (but post back and tell us
> how well it worked).
>
> --- Amir
> thumbed via iPhone
That rule seems to be working fine. Has hit on every one of those pesky
messages so far with no FP's. Will let it run for a while longer before I
bump up the score.
Thanks
Andy