Phish - two simple techniques that make the "obvious" tests viable

Chip M. Wed, 24 Feb 2010 02:12:04 -0800

Every few months, someone suggests detecting phish by looking for a
different domain in the target vs display URL in HTML links.


Other suggestions have included testing for different domain in the
SMTP envelope Sender and the hostname of the sending IP.

Every time, the grizzled veterans patiently explain that these are
completely logical approaches, however enough idiot mailers follow
these patterns that they're not viable tests.

There ARE two simple techniques that, when COMBINED, make these
"obvious" tests viable:
 1. only test emails with a "phishy token" in a key header
 2. use domain "skip" lists to exclude the idiots

Instead of running these tests on ALL email, just test stuff which
has a "phishy token" (e.g. "ebay", "citibank", "wamu") in the
domain of either the SMTP Sender or From.  If you're willing to
risk a higher FP rate, you can also check the RealName part of
From, or even the Subject header.

Domain skip lists handle the classic case of CitiBank.
Just skip list all their "problem" domains, and implement your
tests so they don't fire if BOTH domains are skip listed.

The beauty of this approach is that (after the initial coding) it's
100% data-driven, can be tailored to your aggression comfort level,
is extremely fast/low-overhead, and is EASY to maintain.

Plus, it works.
I've been doing this for almost five years. :)



Example:
A few weeks ago, there was an explosion of Facebook phish.

I immediately added "facebook" to my list of phish tokens, loaded
up a few months worth of Facebook ham, dumped all the sender
hostname domains, added those to my global domain skip list, dumped
a list of all unique domains that had diffs between target and
display URLs, skip listed those, ran a selective MassCheck against
the Facebook ham, then deployed.

The data loading step took longer than all other steps combined. :)

At the same time, I also checked MySpace, and made all necessary
changes, in case that became targeted (yes, it did).



Other Tests:
This approach allows several other simple phish tests.  Currently
I've got 11 small tests, some of them sane+sensible, some of them
uber-aggressive (and only appropriate if you have a good FP
pipeline).

I came up with this approach after digging thru about a hundred
hand selected phish, and noticed that _ALL_ had different domains
in the SMTP Sender and From.  I was about to code that as a test,
when I realized that pattern was common with mailing lists, so I
needed some way to restrict which emails were tested.  Hence, the
phishy tokens tactic.  It was a little later that I added skip
domains, mainly to replace some hard-coded ugly kludges.

I find that rule (diff domain in Sender vs From hdr) is MUCH more
effective than a URL diff domain rule.

Note that an IP-based exception must be made for Paypal (the From
domain is always different for user transactions).

Here's a few other simple rules that work if using this approach:
- phished domain name appears in Param or Sub-host part of URL
  (unless the URL's target domain is on a skip list)
- Raw IP address in URL
- "unusual" Nation in Received IPs (my test takes as a parameter a
  separate list of permitted Nations, which is customized for each
  domain or group of end user accounts)



Show Me The Numbers!:
Here's some actual stats (current 6 months) for my most diverse
(ham-wise) domain, showing the number of hits for each of the tests
described above (6308 phish hits out of 129604 spams):
    domain-DiffDomains        416
    domain-InParamOrSub      2366
        domain-RawIP               12
    hdr-HostName             4843
    hdr-DiffDomains          3023
    hdr-Nation               5427
NOT all of those were phish, however all did have a phishy token in
a key header.

Here's stats for Jan-2010 for my primary (pure-Geek) domain:
    domain-DiffDomains        144
    domain-InParamOrSub       106
    hdr-HostName              237
    hdr-DiffDomains           242
    hdr-Nation                241
That's for 745 actual phishes (ALL semi-hand-verified, I excluded
Facebook).

Here's 508 Facebook phishes for the same period and domain:
    domain-DiffDomains          0
    domain-InParamOrSub       508
    hdr-HostName              508
    hdr-DiffDomains           508
    hdr-Nation                496

When I hand verified all my phish hits for that data sample,
318 were NOT actual phish.  There were a total of 21955 spam.
The only FPs were 2 from an oft targeted company that we had never
done business with (I should have pre-emptively skip listed their
key domains ages ago).

All tests listed by domain/hdr, then in the order I described them,
above.



Advanced Considerations:
About two years ago, I split my phish tokens into two lists:
generic (e.g. "bank") and specific (e.g. "ebay").  That gave me
more flexibility in my matching algorithms that decide whether to
run the phish tests.

The algorithm for "specific" tokens is very simple.  The main
consideration is handling occurrence in the From's RealName
(I added that much later).
 
My "generic" tokens algorithm looks at position within each domain,
and other factors.

I recommend implementors start with a single simple algorithm, then
play with some data and tweak for effect. :)

As the Facebook stats show, you can achieve a VERY high kill rate
with JUST the simple stuff. :)

In general, specific tokens are MUCH safer.
If you have a good corpus and tools, it's easy to data mine skip
domains, and go with some carefully selected generic tokens,
however you're all but guaranteed hits on non-phish spam.


About a year ago, I spotted a phish that used a zombified home DSL
machine as its target.  Since I already had that provider in my
domain skip list (purely for performance reasons), it did NOT
trigger my domain display-target phish test.
I've since added a SEPARATE skip list that is EXCLUDED just from
consideration during phish testing, and moved all large ISPs onto
that list.


A LOT of non-phish spam hit these rules, usually because the
spammer forged a domain that has a financial oriented "generic"
phish token.  I'm ok with the extra kills.  Really, I am. :)
The pedantic part of my brain would be happier if only phish were
killed by these, but this is one of the VERY few times I ignore
that urge.  I hope John Hardin will forgive me. ;)

The more aggressive/generic your phish tokens, the higher your FP
rate (yeah, that's obvious, and is (respectfully) aimed at the
grasshoppers).
I find that most mis-fires are regular senders, so I use lower
phish scores with my new users, identify all their potential
problems, skip list accordingly, then up their scores.
I'm dealing entirely with small to medium domains, and have good
tools, so that makes sense in my environment.  I would expect
large-scale environments to use less aggressive tokens.



Sharing Data:
One thing that would be helpful is if we built up a database of
skip domains for EACH phish target.  I should have been doing this
from the beginning, but instead just auto-added them, instead of
recording which domains matched which target.  In the near future,
I'll be doing some data mining to rectify my lapse.

For example:
        fbcdn.net
        tfbnw.net
belong to Facebook.

I've also been moving the IP ranges of all financial organizations
and financial ESPs into separate "virtual" nations (about once per
month I merge my virtual/manual-override ranges with fresh data
from the RIRs, then redistribute to my user base).

That's particularly useful for non-Americans who use Nation-based
testing (they aren't forced to include all of the USA, when all
they really want is eBay/Paypal/etc).


I hope that's both clear and useful.
I've got a rather bad case of flu, which led to me :) wanting to
hand verify several hundred phish hits, but it could also have
resulted in more obtuse language than usual from me.
        - "Chip"

Phish - two simple techniques that make the "obvious" tests viable

Reply via email to