Re: Finding URLs in html attachments

Chip M. Tue, 02 Mar 2010 14:20:18 -0800

On Sun, 28 Feb 2010, LuKreme wrote: 
> SPF! 
> 
> <runs; ducking, shucking, and weaving>


You're a brave person. ;)

It's easier to understand the challenge Dave faces, if we look at
some actual From headers.

In my stream, these started in early November of last year, so I
just checked a few months of data from one domain which has had a
steady trickle, AND has the richest ham diversity of all the
domains I filter (translation: highest FP rate, and highest number
of custom pass/skip rules (fortunately, also one of my MOST keen
and helpful domain admins)).

Since these started, they've had 19 of these phish:
  1 "Bank of America"<supp...@boa.com>
  1 "PayPaI"<upd...@paypai.com>
  1 "Paypal Inc."<cust_s...@paypalsecurity.com>
  1 "serv...@irs.gov"<serv...@irs.gov>
  1 "serv...@paypal.com"<c>
  1 "serv...@paypal.com"<secur...@act.embarqservices.net>
  3 "serv...@paypal.com"<Security>
  1 "U.S. Bancorp"<off...@usb.com>
  1 "Wachovia"<supp...@wachovia.com>
  1 "Wells Fargo Online"<ofsreponline.al...@wellsfargo.com>
  1 Bank of America <memberserv...@bofa.com>
  2 Bank of America <serv...@boa.com>
  1 Bank of America<memberserv...@boa.com>
  1 Internal Revenue Service<service.refun...@irss.com>
  1 Western Union<memberserv...@poste.it>
  1 Western Union<memberserv...@wumts.com>

(first column is frequency)

This was from a sample size of:
  106171 spams
   43692 hams

Note the variations on Paypal, none of which would trigger an SPF
issue (some did have matching SMTP Senders).  Note the clever use
of RealNames to mask the actual From domain.

By spam standards, these are VERY well crafted.


Note that ALL hit my phish tests, as outlined last week. :)


In that same sample, I found only 3 hams with base64
application/octet-stream html attachments.  Given their ham
diversity, that was most promising.

The hams were:
  jcpenney.com
  (they're already part of our manually maintained "bulk" nations,
   with an implicit set of skip conditions)

  a local church
  (one html attachment was in amongst a ton of other stuff
   (mostly Word docs), all domains were already skip listed,
   and the sender already had a modest pass rule)

  "Britannica Elementary Encyclopedia article"
  (had _LOTS_ of other issues (including INVALID_DATE), and
   FP'd quite spectacularly!)

When these phish first appeared, I did a similar ham check
(further back, more domains), and found no major issues, so I
ended up adding a base64 html attachment content rule.

Dave, I do have one university professor research domain (and it
was one of the corpora I ham checked), however it's in the social
sciences, so it's probably a significantly different ham ecology
from what you're seeing.

I have a strong impression that you're a :) data analyzing kind of
guy, and probably have decent logs.  Do you see many ham base64
html attachments?

That's more my curiosity, than anything.  Those just feel like the
sort of thing it's legitimate to penalize, though of course it
depends on your FP pipeline tools, and user community.
I've supported PhDs, and find quilting grannies far easier. ;)


In my own post-SA filter, I've been extracting URLs from these,
for years.  In most cases the domains were VERY useful and did
trigger on some blocklists.  It's definitely the more technically
correct approach.  I still use the kludge content rule, mainly for
belts-and-suspenders, since these ARE well crafted.

Given the low rate of occurrence in ham, I didn't anticipate any
significant extraction performance issues, though I do have a size
constraint in place.  If that's the concern, these have all been
small-ish.

John mentioned the reasoning for SA not extracting was:
    "if the MUA doesn't display it automatically, why should we scan it?"
Which makes perfect sense as a general principle, however, in the
case of these phish, social engineering is the vector for their
display.

Apologies if I'm missing blatant Perl or SA architecture issues,
about which, I am only an egg.
        - "Chip"

Re: Finding URLs in html attachments

Reply via email to