[Alan Arndt]
> Over the past month or more I have noticed a large increase in the amount of
> spam I receive with the Spam text translated into images.  The actual text
> of the message is benign gibberish designed to pass Bayesian filters.  They
> have even taken the step of inserting random bits into the image so that no
> two images have the same signature.  I've received many multiple messages
> with the same fundamental image.

Yup, and they're learning to avoid other stupid mistakes too; e.g.,
the gibberish /changes/ from one message to the next, and so does the
forged sender address.  While randomization isn't new in spam, most
spammers have traditionally done a poor job on it.  For example, for a
long time it was very effective to train on the gibberish, since
multiple spammers appeared to use randomization software that produced
the /same/ gibberish time after time.  Likewise they tended to forge
the same sender addresses repeatedly.  Most spam still does, for that
matter.  But some spammers have gotten much smarter.

> I haven't thought of a decent way to filter these types of things.

Me neiither.  They're never false negatives for me, but I reliably get
a few unsures every day from what appears to be the same pump-and-dump
scam-spam source (these are messages hard-selling specific penny
stocks -- the scammer hopes to drive up the market price ("pump") by
stimulating demand, and then sell quick at a profit ("dump")).

It's very much in the spirit of SpamBayes to generate tokens for what
the user /sees/, but in these cases we have no idea what the user sees
(except for the gibberish text).

BTW, it's typical of pump-and-dump scams that they're not trying to
extract money /directly / from you (they're trying to get you to buy a
stock on the open market), so we don't even  get a URL or mailing
address to tokenize.

>  I hope someone else can and that it can get implemented into SpamBayes.

It's discussed here (maybe more so on spambayes-dev, the related
developers' mailing list) regularly, but AFAICT extracting readable
text from images is a complicated and expensive job.  If someone finds
a programmatic way to do it cheaply and with reasonable accuracy, I'm
sure SB could make excellent use of it.
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to