On 1/30/2014 1:25 PM, John Hardin wrote:
On Thu, 30 Jan 2014, Amir Caspi wrote:
On Jan 30, 2014, at 10:28 AM, Kevin A. McGrail <kmcgr...@pccc.com>
wrote:
If you want to share the complete rule, I can throw it into my
sandbox and see what masscheck thinks as well.
The complete rule would be something like this, assuming Andy
implemented it as I wrote it:
rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}/
describe HTML_NONSENSE_TAGS Many consecutive multi-letter HTML
tags, likely nonsense/spam
score HTML_NONSENSE_TAGS 0.001
Score to be adjusted as needed, of course.
I'd suggest writing it as a subrule first, to see how well it performs
against the masscheck corpora. If it does well by itself (good hits,
high S/O), then a meta can be added to expose it for scoring. If it
hits a lot but the S/O ratio is low, then it could be analyzed for
possible combinations with other rules to get something that performs
well.
I think that's overkill and I've already added it to masscheck to see
how it does with a ceiling of 2 on the scoring so we can get some
feedback and adjust.
Otherwise, my POV is that masscheck is designed just for that purpose
and check if the rule has merits to move to publish.
I could be persuaded for a lower-score ceiling or a nopublish flag if
you insist though.
Regards,
KAM