Re: Filtering zip spam

Chip M. Wed, 28 Apr 2010 12:52:49 -0700

>I'm seeing an increase in zip attachment spam, and hoped someone
>could help me figure out why it isn't being properly tagged. Are
>others seeing this? Is BAYES_99 being triggered or is it lower?


Alex, does Bayes understand/check INSIDE zips, at least for file
properties?  If not, then it is inherently limited (just in this
context), which is a big part of why this is such an effective
technique.  Adding that to Bayes should be relatively straight
forward, and should make zips less attractive to spammers.


>The score is very low. Does someone have an idea of other
>characteristics that I can flag on?

One simple approach is to score all "small" zips, then meta that
with other characteristics, like ANY blocklist hit, "unusual"
nation of origin, etc.

That's safer than outright blocking merely "unusual" nations, like
France. :)

That's how I first handled zips, a few years ago, and it's fairly
effective.  Small zips in ham are VERY unusual, and typically are
sent by more sophisticated users, so it may be viable to have a
Subject-based "skip" rule (again, via metas) that would cancel out
other tests.

To avoid FPs, I'm using the RealName-based rules I described almost
three years ago (I have several "skip" rules daisy-chained off
those - a good example of an anti-spam mechanism which turned into
a very effective anti-FP mechanism).
Note that all the current zips have incorrect RealNames.


Alex, as with all rules, it really depends on your ham ecology.
Feel free to share more info about yours (we need the equivalent
of the Geek Code for ham ecology!).  When you first started
posting, I briefly assumed you were a college student, then
gradually realized you have decent volume and diversity. :)


All of the recent zipped file campaigns look like the work of last
year's inline-PNG/RTF coder, so we could well be in for more
variants.

Using zips is an interesting delivery mechanism.  Most Windows
versions have easy means to open them, and there's an element of
novelty (even I was almost excited when the first zipped JPEG
arrived - followed by disappointment that it was merely a
"standard" wavy pharm).


Another approach I had been using was a (post-SA) test that
extracts all filenames, and just looks for any specified file
extension(s).

It worked, but that test was designed for malware detection, and
has VERY limited options.  There was no means of restricting it to
a zip containing just one small RTF and no other files, so my
initial rule would have mis-fired on anything with a mix of files.

I finally had my Kaylee Frye moment about two weeks ago, and
(in my post-SA filter (sorry, written in Object Pascal)) wrote a
brand new "Zip Info" module, similar to "Image Info".

I designed it to expose far more info, and wrote the rules module
so I'd have far more control than was currently "necessary".

As I mentioned in a post in January, I had noticed a consistent
value in an Image properties field which I was calculating, but
not (at the time) exporting.
I'm trying to avoid that mental kick moment. :)


SANITY CHECK please!
Here's what I'm currently exporting:

Entire zip:
    - number of files
    - compression ratio (i.e. across ALL files)

Per file:
    - filename
    - compression ratio
    - file date

The only property I'm not currently doing anything with is the
individual file date.  I'm having my endusers log their ham data
for a few weeks, then I'll see if there's anything useful, ham vs
spam wise.  I predict ham will have a rich date range, and spam
will be mostly/entirely recent.  I may add a simple "younger/older
than n days" test, regardless, since when dealing with spammers,
Logic is often NOT the beginning of Wisdom. ;)


Implementing the basic properties extraction was trivial.
Thinking thru how I wanted to handle the rules was more of a
challenge. :)

Figured I'd share where I'm at, and pick the big brains. :)
    - "Chip"

P.S.  I am also seriously considering adding the ability to extract
any specified file as a text or binary stream, with the text stream
defaulting to being fed to a domain extraction module.

It's not unreasonable for somebody to send a legit zipped RTF, so
content scanning would be good.  These spam RTFs in particular are
tiny (low overhead to extract) yet intensely spammy.

Re: Filtering zip spam

Reply via email to