> If, after excluding black, we find that 100% of the color map is that
> nasty pastel pink or pastel lime green (etc) then it's a spam and we
> toss it.
>
> Sound reasonable?

I was thinking about this the other day.  I think the concept is reasonable,
but as stated doesn't go far enough, and would be trivial to bypass.

I think that someone first needs to come up with either a formula or a list
of RGB triples that are "visually indistinguishable" or some such.  (I
suspect this has been done several times now and the research should exist
in the wild.)

This can then be used as a fuzz to group colors that are very close down
into a common bucket.  As it is, trivial 1-bit variations on colors would
defeat the simple scheme.

It might also be interesting to accumulate a) total area of each color and
b) largest rectangle (or other easily detected shape) of each color.  The
first case we would have from the pixel counts.  The second case could be
used to detect large areas of fill color.  This might help classify a text
message vs a map of the world or a picture of downtown Camaroon.

It also might be interesting to accumulate statistics on the common color
distributions for 10K or so legit images sent through email, possibly along
with some sort of indication of purpose: "picture of me", "picture of my
dog", "billboard I saw", "kids at Christmas", "Hallmark greeting card", etc.

With that info the color distribution might be able to help classify the
image fairly cheaply.

I don't know how much of the above would be absolutely necessary, but I
suspect at least some of it is.  Still, this is a fairly trivial sort of
thing to have to accumulate.  Expecially since all spam (at least currently)
uses gifs, which a blind man can decode with a hair comb - no fancy software
required.

        Loren

Reply via email to