Re: [SURBL-Discuss] MIT Spam conference
Yarg. I hate it when this happens. Maybe it's free, but it's still ~$600 to get me there and back, and I can't write it off or cover it personally just now. Ummm... Hey! Anybody want to pay me to program some stuff or write some rules or something? :-) I'll take good notes. :-) - Ryan William Stearns wrote to ML-spamassassin-talk and ml-surbl-discuss on Fri,...: Good day, all, I'll be attending the MIT spam conference this year, Jan 21st, 9-5. Details at http://www.spamconference.org/ . The registration is free, but they suggest an early registration before the conference fills up. I'd love a chance to meet other people working on spamassassin and surbl. Is anyone else planning on attending? Cheers, - Bill --- "God grant me the senility to accept the things I cannot change, The frustration to try to change things I cannot affect, and the wisdom to tell the difference." (Courtesy of Mike Ricketts <[EMAIL PROTECTED]>) -- William Stearns ([EMAIL PROTECTED]). Mason, Buildkernel, freedups, p0f, rsync-backup, ssh-keyinstall, dns-check, more at: http://www.stearns.org -- ___ Discuss mailing list [EMAIL PROTECTED] http://lists.surbl.org/mailman/listinfo/discuss -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: A simple way to...
Robin Lynn Frank wrote to users@spamassassin.apache.org: We use SA 3.0.0 with MySQL so we can extract certain AWL data and use it at the MTA level. However, since SA doesn't have an auto-blacklist feature, Hi Robin, Actually, "AutoWhiteList" (AWL) is a bit of a misnomer. AWL maintains average message scores for sender/class-B tuples, so, in effect, it is also an auto blacklist, because repeat spam senders will have high average scores in the AWL database. I'd like to find a relatively simple way to extract IP addresses from emails that contain spam. If it is of any importance, we invoke SA via amavisd-new. See, for instance, the check_whitelist script in the tools/ directory of the distribution. I get output like this: -4.5 (-35.6/8) -- [EMAIL PROTECTED]|ip=64.59 9.3(27.9/3) -- [EMAIL PROTECTED]|ip=65.39 The first line is for a user that sends ham, so his/her score on future messages would be pushed closer to -4.5. The second line is for a user that sends spam, so, if they sent a more hammy message later, the AWL would likely *add* points to the message, while decreasing the average slightly. It works both ways. If you want to use this at the MTA level, I could envision you wanting to grab, say, every entry over a certain average score and potentially greylist based on that or something. Hope this helps, - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: scan times up!
Chris Santerre wrote to Spamassassin-Talk (E-mail): Well... ver avg scan time 2.4x2.7 seconds 3.0 30.4 seconds OH MY! Network test :) Any longer and I might just be doing greylisting by accident. ;) :-) Others have pointed out some possible causes. I did fairly extensive testing between 2.6x and 3.0 before upgrading, which included performance benchmarks, and, for certain configurations, I found 3.0 to be marginally faster than 2.6x. In all cases *with equivalent configurations*, performance was about the same. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Announce: GetURI 1.6 Released
2004-09-30: GetURI 1.6 Released I'm very pleased to announce the release of GetURI 1.6. Many new features have been put into to this quickly growing program, as have a few important bug fixes. Everyone already using GetURI is strongly encouraged to upgrade as soon as possible. If you haven't yet tried GetURI, now is a great time to start! What is GetURI? GetURI is a program using the SpamAssassin libraries, designed to extract URIs from ham and spam messages, mbox files, or lists of domains, and present them in a format designed to help classify domains for anti-spam efforts such as SURBL, although it has other uses, too. The included 'uricat' utility provides a simple way to extract URIs from virtually any text file, regardless of how they are encoded. With the help of the SpamAssassin libraries, GetURI attempts to ignore unclickable domains (i.e., poisoning attempts), follow redirects, and otherwise simulate the action of mail user agents (MUAs) as closely as possible. Sample output: http://ry.ca/geturi/results.html What's new? Here are just a few of the most notable additions to GetURI 1.6: - Support for SpamAssassin 2.6x has been re-introduced. Now 3.0 and 2.6x are officially supported - By popular demand, support for processing mbox files has been added - GetURI now does several forward lookup checks on domains, including SBL/XBL, IADB2/WADB, as well as checks on nameservers, to aid classification. - More documentation is now included in the output, and the output format has been improved visually, to hopefully be somewhat more intuitive. - It is now possible to specify a specific SURBL host to query, instead of the previous default of multi.surbl.org - A potentially large memory leak was discovered in the handling of SA3.0 objects. Consequently, SA3.0 users should upgrade immediately to enjoy drastically reduced memory consumption. Many more changes have been implemented; please see http://ry.ca/geturi/CHANGELOG for details To fetch the new version of GetURI, please visit http://ry.ca/geturi/ As always, your feedback will help improve GetURI! Additional testers are always welcome. - Ryan Thompson <[EMAIL PROTECTED]> -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: MIMEDefang, SpamAssassin and URIDNSBLs
Tim Boyer wrote to users@spamassassin.apache.org: 3. Do I have DNS lookup enabled? Yup: # Enable or disable network checks dns_available yes skip_rbl_checks 0 rbl_timeout 15 Can't think of anything else to try. Do you have # If boolean true, skip SA network tests $SALocalTestsOnly = 1; in your mimedefang-filter? Make sure you set $SALocalTestsOnly to zero. For whatever reason, MIMEDefang decided they would override this *one* SA option within mimedefang-filter. ;-) If that doesn't help, get a bigger hammer, or maybe ask on the MIMEDefang list. If I knew how to make MIMEDefang call SpamAssassin with the debug switch, that might point me in the right direction. MIMEDefang uses the SA libs directly... which means, so can you, in mimedefang-filter. :-) I've never tried it, but you should be able to enable debugging output before calling the SA check in filter_end(). - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: stripping SA headers for reporting? (spamcop, etc.)
Andre Nicholson wrote to users@spamassassin.apache.org: John Owens wrote: I'd like to send as original a message as I can to SpamCop and other places since they don't like munged reports. Currently I'm doing this manually, which is annoying. I note that sa-learn knows how to remove all SA-specific annotations from a message (unwraps MIME, removes headers, etc.). Is that functionality available in any other way? spamassassin -d < MESSAGEFILE > NEWFILE Or to also report it afterward spamassassin -d < MESSAGEFILE > NEWFILE && spamassassin -r < NEWFILE RTFM, folks. :-) SPAMASSASSIN(1): -r, --report Report this message as manually-verified spam. This will submit the mail message read from STDIN to various spam-blocker databases. [...] If the message contains SpamAssassin markup, the markup will be stripped out automatically before submission. This does the same thing as -d before submission. If it doesn't do what you want, then your upstream probably isn't adding SA markup. (i.e., they're wrapping it themselves using MIMEDefang or something). - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: URI obfuscation check
Jeff Chan wrote to SpamAssassin Users: Update on the previous, interestingly the HTML renderer in The Bat! 1.62q did not make the link clickable, but the plaintext message renderer did. That's because the HTML did not actually contain a link (anchor); just the plaintext URI. Many plaintext renderers will, however, link anything that looks like a URI. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: Start an IP list to block?
[ Whew! CC trimmed :-) ] Jeff Chan wrote to Justin Mason: Yeah. I was referring to the proposal to lookup IP addresses for href hostnames directly (instead of looking up the NS'es.) Yep. Resolving domain names found in spam URIs is slow Aha. Key word = "domain names". All the world's a host. Spammers are already using random subdomains in their emails, and there is absolutely *no* guarantee whatsoever that these subdomains resolve to the same IP(s) as the registrar domain (or even as the rest of the subdomains). It's basic DNS, and, in this case, it means we're basically screwed before we start. :-) There *may* be some benefit to the idea, but I'm betting it would be extremely short-term, because spammers would too easily thwart it by pointing their TLDs A record to somewhere else. Unless we started keeping more host information...but then we're effectively DoSsed by the sheer number of subdomains in use. There are a few ways I could think to greatly optimize that, but, so far, I don't see a big win. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: Start an IP list to block?
Jeff Chan wrote to Ryan Thompson: On Thursday, September 9, 2004, 2:34:00 PM, Ryan Thompson wrote: "Can't" is a curse word to a scientist. "Can't *yet*", on the other hand, is usually a good motivator! - Ryan A good scientist has at least a working understanding of the theoretical limits of knowledge. Hahaha! Ye cracketh me up, Jeff. If you ever find yourself in Saskatchewan, you can drink my beer and we can talk scientific philosophy. :-) Now, I'm going to get back on topic before somebody starts shooting. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: [SURBL-Discuss] Start an IP list to block?
Jeff Chan wrote to SURBL Discussion list and Spamassassin-Talk (E-mail): .com is so large and rapidly changing as to be practically unknowable. That's what I mean by "can't". IIRC, .com is up to about 25M domains, and it's way, way higher than the other gTLDs (and light years beyond ccTLDs). By the time you have all of .com fully cataloged, it will have changed significantly. 25M queries isn't that hard, and it can be trivially distributed to make for a more responsive system. Even 250M isn't out of reach. As I mentioned, the base problem has already been solved by whois.sc, and probably others. We just need to adapt it to be useful in fighting spam. Oh, and, we can *also* use this data to safely determine domain age for newly registered domains. Since the most spammy domains are less than a week old, we'll start to have useful information for *that* within about a week. :-) Really the only ones who could collectively determine how spammy a particular virtual host IP is are the domain registrars working together and pooling all their registration data then resolving every hostname and building a database of all the resolved IPs mapped back into all of their domain names. That's *exactly* what I'm suggesting, and the registrars already pool their data. They're called TLD zone files, and (almost) anyone can download them. If you can't see all the good guy domains on a virtual hosting IP, then you can't see who else you would block. We *can*, Jeff. We can. That was the whole point of my message. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: Start an IP list to block?
Jeff Chan wrote to Chris Santerre: It is a question about the limits of knowledge. In our universe we can't see the potential collateral damage from listing a shared host, so we should not do it. From our point of view it's not knowable. Sure the hosting company knows whether that's the case, but we can't. Ahh... but we *can*! See my follow-up. I'd encourage people with questions like this to read up or take some classes on epistemology or the theory of knowledge. Or just contemplate the possibilities harder... ;-) Umm, or just help me with zone data. :-) "Can't" is a curse word to a scientist. "Can't *yet*", on the other hand, is usually a good motivator! - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: [SURBL-Discuss] Start an IP list to block?
Chris Santerre wrote to SURBL Discussion list (E-mail): OK, this isn't the first time we've had this discussion, but Raymond and I felt this should be made public again. He ran thru some tests of 1500+ domains and found the following data. Looks like they maybe send from zombies, and never their hosts. IPs are similar across the board. So is there a way to use the IP info in a good way? Could SA or SURBL do a quick ping of the URL and match against a URL? This would allow us to simply list 1 IP instead of all these domains. (I'm well aware of virtual hosts! So only the filthiest of spammers would be put on this IP list. Then their IP better boot them or anyone hosted on that box would feel the rath of SURBL.) I talked to Raymond about this, too... and, basically, here are my big thoughts: We need to find the correlation of IP addresses to hostnames. See http://whois.sc/ ; I can, with some help, duplicate what they're doing in a way that will help us fight spam. Then, for 219.254.32.111, we could see that there are, say, 200 sites hosted at that IP, and, after some hand checking, identify that all of them belong to spammers. However, for all we know *so far*, 219.254.32.111 could be a HA cluster of a few dozen machines, and, while there may be 200 pill spammers on that cluster, there may be 20,000 other legit sites. With our current data, we can't make either determination. But, using forward zone data, we can do forward lookups, and track them in a database. Then, do forward lookups on SURBL data to get the IPs of spammers, and (algorithmically!) find correlations. The programming effort to implement this would not be trivial, not to mention processing power and bandwidth, to do the initial run. The datasets (.com!) are huge. After that, we just have to periodically sample for new, removed, and changed domains, at which point the processing will be reduced. Still, there's no way I have time or money to do this alone, given my current commitments. I *wish* I could spend my whole day fighting spam. I'd need a fair amount of real help. It'd be good to make happen, though, considering we could then *proactively* list domains (or IPs) with a high degree of confidence and little or no collateral damage. (Because we can *measure* collateral damage if we know which other domains are hosted on a particular IP). And there would be many many other statistical benefits we could gain. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: [SURBL-Discuss] Ham corpora needed
Jeff Chan wrote to SURBL Discuss and SpamAssassin Users: In order to reduce false positives in the SURBL data, we would like to have access to ham corpora. Does anyone know of any public ham copora, including just the URI domain names from the hams? Or is there anyone who would be willing to run our URI domain lists against their ham? Does anyone know if messages from the Enron corpus have been categorized for ham and spam? http://www-2.cs.cmu.edu/~enron/ Thanks in advance for any suggestions, comments, thoughts FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-) Now, I know not everybody runs SpamAssassin, but it *does* have a really easy log format and hit-frequencies program. It's possible to concatenate ham and spam logs from different sources to effectively get statistics on a larger corpus... and only the test hits are stored in the log, so the results are effectively anonymous. There's ham.log for ham, and spam.log for spam, and the entries look like this, one line per message: Y 7 /spamdir/11710. URIBL_OB_SURBL,URIBL_WS_SURBL time=1089946124 Rather than re-invent the wheel, you can have your checkers output simplified mass-check logs. The only column that matters is the tests column. Something like this should work well enough for hit-frequencies: N 0 URIBL_TESTS_HIT,COMMA_DELIMITED time= Then, grab hit-frequencies from the SA distribution and you can reproduce the output that others have been posting. If you *do* have SA installed (even if you don't filter your mail with it), it's even easier. Just set up a simple .cf file with the URIBL rules (I'll provide one on request), and invoke mass-check in the tools directory like so: ./mass-check -p=../rules -c=../rules --net -j=20 --progress \ spam:dir:${SPAMDIR} ham:dir:${HAMDIR} Then run: ./hit-frequencies -s 3 -p It's almost worth extracting Mail-SpamAssassin from CPAN just to gain that functionality. You don't even have to *use* SA. :-) - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists
Jeff Chan wrote to SURBL Discuss and SpamAssassin Users: Basically the higher the FP rate, the less useful a list is. ... or, rather, the lower it ought to be scored. Does anyone have other corpus stats to share, in particular FP rates? Sure. All of these messages were received in the past 10 days. A lot has happened since June. :-) WS: 44004/54185s, 61/19150s OVERALL% SPAM% HAM% S/ORANK SCORE NAME 7333554185191500.739 0.000.00 (all messages) 100.000 73.8870 26.11300.739 0.000.00 (all messages as %) 60.087 81.2107 0.08360.999 0.000.00 WS_SURBL HOWEVER... I decided to go through the ham hits (61 of them), and look for false positive domains to submit. I found several, but, for the most part, they've *already* been cleaned up and are no longer listed in WS. (30 out of the 61 were in a massive mailing list thread for a single domain that has since been whitelisted). And, in that 19K ham corpus, I found the following FPs still listed in WS: buckeye-express.com -- Used in a personal email address, looks legit; 7 examples nm.ru -- Used in a personal email address, looks legit advanstar.com -- Legit uses; found in a well-known dental newsletter; also personal email address of one of the editors; 3 messages 00fun.com -- Confirmed, more than one user on our system sent or received eCards from them northstarconferences.com Legit conference host site subscribed to by two users; 9 messages in this corpus mardox.com-- Search engine; registered 1875 days ago, and *looks* like the user did actually submit their site to them. postsnet.com -- Registered exactly one year ago, 51 NANAS, blank home page, ehh... but I have 4 different legit newsletters with links to them. webspawner.com-- Created in 1996; free host/email npdor.com -- Surveys; been around since 1999. 103 NANAS, but they've been advertised by some reputable "word of the day" mailers (dictionary.com) Maybe a good candidate for UC. :-) 2 examples imninc.com-- Domain is 507 days old; they do newsletters. At least one of them is legit. :-) worldhealth.net -- It's 3468 days old today (1995). One of our users attended a conference of theirs, and signed up for a newsletter. hoteldiscounts.com-- 2459 days old (1997), found in actual room booking confirmations for Comfort Inn. (I'll re-post these in another thread, just so everybody sees them). AND, I found 2 spams that were incorrectly hand-classified as ham. So, if I take those out, the numbers look more like: WS: 44006/54187s, 0/19148s OVERALL% SPAM% HAM% S/ORANK SCORE NAME 7333554187191480.739 0.000.00 (all messages) 100.007 73.8897 26.11030.739 0.000.00 60.087 81.2111 0.1.000 0.000.00 WS_SURBL Is that more like what you had in mind..? No, I'm not making that up. :-) Anyone with ham corpora, just search for WS_SURBL hits and give 'em a hand-check. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: shifting the midpoint between the average spam and average ham scores back to 5.0
Joe Flowers wrote to users@spamassassin.apache.org: Help please! If the average spam score of all of my ham messages is 1.0 and the average spam score of all of my spam messages is 3.0, then what is the best way to move the average_of_ these_two_averages (2.0) back up to 5.0? The result being that I need my current average score for ham messages to be "4" and my current average score for spam messages to be "6". And, I need to do this without screwing up the relative statistics of spamassassin. Hmm... After reading this thread, I think you *do* have a good question, here, and that you did already get some good answers, but I'd like to add a bit. You make a valid point in that, if graphed separately, ham and spam should show up as two separate curves on a graph. However, there *is* overlap, and spam and ham (separately, or together) scores are *not* normally distributed. They don't have to be to calculate the mean of the means, but, in doing so, you're going to have a great deal of false positives. What you really should do is decide how many false positives you (and your users) can live with. For us, it's 1/2000 (0.05%, one twentieth of a percent). For this, you don't even need a spam corpus. Just collect a good ham corpus (to get 0.05%, you need at least 2000 ham) and look at the SA scores. Choose your threshold (or your constant modifier) to hit on less than 1/2000 messages, and re-check regularly. You can cross-check this with a spam corpus, if you want to balance FPs against FNs (if you're well below your maximum FP ratio, you have some room to play). We get a lot less than 1/2000 FPs (usually 0), but 1/2000 is the maximum ratio we'd allow before increasing the threshold. - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America