On 1/30/2014 1:25 PM, John Hardin wrote:
On Thu, 30 Jan 2014, Amir Caspi wrote:

On Jan 30, 2014, at 10:28 AM, Kevin A. McGrail <kmcgr...@pccc.com> wrote:

If you want to share the complete rule, I can throw it into my sandbox and see what masscheck thinks as well.

The complete rule would be something like this, assuming Andy implemented it as I wrote it:

rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}/
describe HTML_NONSENSE_TAGS Many consecutive multi-letter HTML tags, likely nonsense/spam
score HTML_NONSENSE_TAGS    0.001

Score to be adjusted as needed, of course.

I'd suggest writing it as a subrule first, to see how well it performs against the masscheck corpora. If it does well by itself (good hits, high S/O), then a meta can be added to expose it for scoring. If it hits a lot but the S/O ratio is low, then it could be analyzed for possible combinations with other rules to get something that performs well.

I think that's overkill and I've already added it to masscheck to see how it does with a ceiling of 2 on the scoring so we can get some feedback and adjust.

Otherwise, my POV is that masscheck is designed just for that purpose and check if the rule has merits to move to publish.

I could be persuaded for a lower-score ceiling or a nopublish flag if you insist though.

Regards,
KAM

Reply via email to