Every few months, someone suggests detecting phish by looking for a different domain in the target vs display URL in HTML links.
Other suggestions have included testing for different domain in the SMTP envelope Sender and the hostname of the sending IP. Every time, the grizzled veterans patiently explain that these are completely logical approaches, however enough idiot mailers follow these patterns that they're not viable tests. There ARE two simple techniques that, when COMBINED, make these "obvious" tests viable: 1. only test emails with a "phishy token" in a key header 2. use domain "skip" lists to exclude the idiots Instead of running these tests on ALL email, just test stuff which has a "phishy token" (e.g. "ebay", "citibank", "wamu") in the domain of either the SMTP Sender or From. If you're willing to risk a higher FP rate, you can also check the RealName part of From, or even the Subject header. Domain skip lists handle the classic case of CitiBank. Just skip list all their "problem" domains, and implement your tests so they don't fire if BOTH domains are skip listed. The beauty of this approach is that (after the initial coding) it's 100% data-driven, can be tailored to your aggression comfort level, is extremely fast/low-overhead, and is EASY to maintain. Plus, it works. I've been doing this for almost five years. :) Example: A few weeks ago, there was an explosion of Facebook phish. I immediately added "facebook" to my list of phish tokens, loaded up a few months worth of Facebook ham, dumped all the sender hostname domains, added those to my global domain skip list, dumped a list of all unique domains that had diffs between target and display URLs, skip listed those, ran a selective MassCheck against the Facebook ham, then deployed. The data loading step took longer than all other steps combined. :) At the same time, I also checked MySpace, and made all necessary changes, in case that became targeted (yes, it did). Other Tests: This approach allows several other simple phish tests. Currently I've got 11 small tests, some of them sane+sensible, some of them uber-aggressive (and only appropriate if you have a good FP pipeline). I came up with this approach after digging thru about a hundred hand selected phish, and noticed that _ALL_ had different domains in the SMTP Sender and From. I was about to code that as a test, when I realized that pattern was common with mailing lists, so I needed some way to restrict which emails were tested. Hence, the phishy tokens tactic. It was a little later that I added skip domains, mainly to replace some hard-coded ugly kludges. I find that rule (diff domain in Sender vs From hdr) is MUCH more effective than a URL diff domain rule. Note that an IP-based exception must be made for Paypal (the From domain is always different for user transactions). Here's a few other simple rules that work if using this approach: - phished domain name appears in Param or Sub-host part of URL (unless the URL's target domain is on a skip list) - Raw IP address in URL - "unusual" Nation in Received IPs (my test takes as a parameter a separate list of permitted Nations, which is customized for each domain or group of end user accounts) Show Me The Numbers!: Here's some actual stats (current 6 months) for my most diverse (ham-wise) domain, showing the number of hits for each of the tests described above (6308 phish hits out of 129604 spams): domain-DiffDomains 416 domain-InParamOrSub 2366 domain-RawIP 12 hdr-HostName 4843 hdr-DiffDomains 3023 hdr-Nation 5427 NOT all of those were phish, however all did have a phishy token in a key header. Here's stats for Jan-2010 for my primary (pure-Geek) domain: domain-DiffDomains 144 domain-InParamOrSub 106 hdr-HostName 237 hdr-DiffDomains 242 hdr-Nation 241 That's for 745 actual phishes (ALL semi-hand-verified, I excluded Facebook). Here's 508 Facebook phishes for the same period and domain: domain-DiffDomains 0 domain-InParamOrSub 508 hdr-HostName 508 hdr-DiffDomains 508 hdr-Nation 496 When I hand verified all my phish hits for that data sample, 318 were NOT actual phish. There were a total of 21955 spam. The only FPs were 2 from an oft targeted company that we had never done business with (I should have pre-emptively skip listed their key domains ages ago). All tests listed by domain/hdr, then in the order I described them, above. Advanced Considerations: About two years ago, I split my phish tokens into two lists: generic (e.g. "bank") and specific (e.g. "ebay"). That gave me more flexibility in my matching algorithms that decide whether to run the phish tests. The algorithm for "specific" tokens is very simple. The main consideration is handling occurrence in the From's RealName (I added that much later). My "generic" tokens algorithm looks at position within each domain, and other factors. I recommend implementors start with a single simple algorithm, then play with some data and tweak for effect. :) As the Facebook stats show, you can achieve a VERY high kill rate with JUST the simple stuff. :) In general, specific tokens are MUCH safer. If you have a good corpus and tools, it's easy to data mine skip domains, and go with some carefully selected generic tokens, however you're all but guaranteed hits on non-phish spam. About a year ago, I spotted a phish that used a zombified home DSL machine as its target. Since I already had that provider in my domain skip list (purely for performance reasons), it did NOT trigger my domain display-target phish test. I've since added a SEPARATE skip list that is EXCLUDED just from consideration during phish testing, and moved all large ISPs onto that list. A LOT of non-phish spam hit these rules, usually because the spammer forged a domain that has a financial oriented "generic" phish token. I'm ok with the extra kills. Really, I am. :) The pedantic part of my brain would be happier if only phish were killed by these, but this is one of the VERY few times I ignore that urge. I hope John Hardin will forgive me. ;) The more aggressive/generic your phish tokens, the higher your FP rate (yeah, that's obvious, and is (respectfully) aimed at the grasshoppers). I find that most mis-fires are regular senders, so I use lower phish scores with my new users, identify all their potential problems, skip list accordingly, then up their scores. I'm dealing entirely with small to medium domains, and have good tools, so that makes sense in my environment. I would expect large-scale environments to use less aggressive tokens. Sharing Data: One thing that would be helpful is if we built up a database of skip domains for EACH phish target. I should have been doing this from the beginning, but instead just auto-added them, instead of recording which domains matched which target. In the near future, I'll be doing some data mining to rectify my lapse. For example: fbcdn.net tfbnw.net belong to Facebook. I've also been moving the IP ranges of all financial organizations and financial ESPs into separate "virtual" nations (about once per month I merge my virtual/manual-override ranges with fresh data from the RIRs, then redistribute to my user base). That's particularly useful for non-Americans who use Nation-based testing (they aren't forced to include all of the USA, when all they really want is eBay/Paypal/etc). I hope that's both clear and useful. I've got a rather bad case of flu, which led to me :) wanting to hand verify several hundred phish hits, but it could also have resulted in more obtuse language than usual from me. - "Chip"