1) use Martin Blapp's OCR plugin/patch for SA.  feed data to bayes.
  http://antispam.imp.ch/patches/patch-ocrtext

2) to combat the "images with subtle differences", develop a checksum method that ignores the lower (3 or 4 bits? out of 8 bits) of each color channel. That way you get what is essentially a very high contrast image, washing out the subtle variations. Checksum that, crop it down to remove all white border area, and compare it to a database of known spam images that have been similarly altered. (which would then suggest: someone developing a razor-like database of image checksums; it'd be nice if the return was a confidence percentage)

(if the alteration leaves an image that is 0x0 pixels (because it became all white) or all one color, then it might be worth flagging it with a decent confidence percentage, as it was composed entirely of subtle variations from a base color, which I would find suspicious)

Reply via email to