Re: Bayes underperforming, HTML entities?

John Hardin Fri, 09 Nov 2018 07:49:37 -0800

On Fri, 9 Nov 2018, RW wrote:

On Thu, 8 Nov 2018 19:24:47 -0700
Amir Caspi wrote:

On Nov 8, 2018, at 4:51 PM, RW <rwmailli...@googlemail.com> wrote:


Unnecessary encoding is fairly common, but a long runs of ASCII
characters encoded like this seems extreme.


Right, that was a question I had asked in my email this morning...
whether we have a rule to detect long sequences of HTML entities.



I was really referring to the fact that it's pure ASCII text that's
being encoded rather than long runs per se, so I'm trying:

rawbody   HTML_ENC_ASCII   
/(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i


I'll add that too so that we can compare the results.

but you may well be right that long runs are inherently suspicious, I'm
not very familiar with HTML practices.

Proposed rule:
body    AC_HTML_ENTITY_BONANZA
(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
describe        AC_HTML_ENTITY_BONANZA  Long run of
HTML-encoded characters score   AC_HTML_ENTITY_BONANZA



Early results (not all corpora are in yet) look *very* promising:

https://ruleqa.spamassassin.org/20181109-r1846219-n/__AC_HTML_ENTITY_BONANZA/detail

3% of spam, S/O .958 and almost all spam hits are <5 points.

I think we have a winner. Thanks, Amir (and possibly RW)!


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Activist: Someone who gets involved.
  Unregistered Lobbyist: Someone who gets involved
       with something the MSM doesn't approve of.         -- WizardPC
-----------------------------------------------------------------------
 2 days until Veterans Day

Re: Bayes underperforming, HTML entities?

Reply via email to