http://bugzilla.spamassassin.org/show_bug.cgi?id=4505
------- Additional Comments From [EMAIL PROTECTED] 2005-07-31 18:42 ------- SM> It's tricky getting a good corpus: ... In addition to your reasons, a good corpus for local use (it's spam here, and always spam here) may not be good for global use (it's not spam to users on that other system over there). And to expand on your SM> There are people who [sa-learn as spam] not because they are clueless, but if they don't recognize that something comes from a subscription or just aren't sure, ... There are also sources that confound matters -- a user can sign up with them for one brand, and receive emails from a corporate parent with a different domain name. SM> And there's Constant Contact who may have found a way around what at first glance appears to be a good defense against spam. SM> ... if Constant Contact really is doing that, they must be counting on low numbers of complaints. Apparently they are, based on the large number of cc.com emails here that qualify for the BSP rules. SM> That link I posted to Ironport's site listed the Bonded Sender fees as of two years ago. It makes it risky for a single customer to spam. But I can see how Constant Contact could have a business model based on getting paid by a mix of spammers and hammers. The Bonded Sender fines are based on number of complaints per million mails. If you want to nail them, get aggressive about reporting the confirmed RCVD_IN_BSP_TRUSTED spam. ... My family gets a lot more ham than spam from cc.com, and so in the past on those rare occasions when we've gotten cc.com spam I've gone directly to them, with satisfactory results. Given what I'm seeing now in this corpus, I'll send in the formal complaints to BSP/Ironport, to increase cc.com's incentive to police their customers. SM> So how do you have a clean corpus when it could contain edge cases that are classified wrong? ... Or, IMO more correctly, a valid and representative corpus used for scoring /should/ have edge cases that may or may not be classified wrong -- there's no other way for a major ISP who can't know what their users did or didn't subscribe for, to manage their spam. It's important to classify them as accurately as humanly possible, but for SA to be optimally useful it needs to be able to make judgments about the edge cases as well, and it can only do that if we take the risk and include them in our corpus. SM> What is the "correct" score for such mail? If the only difference between a piece of spam and a piece of ham is whether the recipient subscribed to it, how do you call either one an FP or an FN for the purpose of the rule scoring program? I don't have answers to that. First pass suggestion: Aim to get these "edge" emails into the 2.0-4.0 score range, so that network tests and hopefully Bayes can push them over 5.0 or under 0.0 as appropriate for the user/site. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
