Re: Need Volunteers for Ham Trap
On 02/07/2011 05:37 PM, Mahmoud Khonji wrote: On 01/21/2011 01:06 AM, Warren Togami Jr. wrote: On 1/20/2011 7:23 AM, R - elists wrote: initially this came across as a really suspect idea... i.e., one man's junk is another man's treasure Ham is a lot easier to define than Spam. Ham is simply anything that you subscribed for. I am currently subscribed to number of mailing lists to collect ham emails (in addition to other sources). While it might be true that mailing lists can be good sources of ham, their emails do not contain realistic diversity of features/characteristics. I explicitly excluded discussion mailing lists from the ham trap. In my view, the issue is not just insuring an email is ham, but also insuring that it contains realistic set of features. If the features are not realistic, and if we optimize tests scores based on that, then we might end up worsening test scores for realistic end-users. Not if it is subscribed to hundreds of opt-in subscriptions for legitimate mail that ordinary users receive, most of which is otherwise not represented in the corpora. Many of these subscriptions send mail only once a week or month. It is true that the hamtrap corpus is synthetic and thus not fully representative in frequencies of real ham. But its volume is only a tiny fraction of a percent of our total ham. It helps us to detect and fix problems in individual rules by injecting some variety without causing a measurable impact on the entire corpus. For example, most list emails are non-HTML. While most end-user ham and spam emails are HTML. Evaluating sets of features (or tests) based on this unrealistic corpus is likely to fools us into thinking that a feature/test is more effective that what it is in reality (i.e. we might end up giving MIME-based tests higher scores). The spec and implementation of this ham trap already took this and many other issues into consideration. We've already had a few experts here conclude the plan is sound. I'm somewhat annoyed by the armchair quarterback negative comments on this topic. (Not just you) didn't read the rest of this thread to realize this particular concern is moot. None of the people complaining about how this is such a bad idea are being helpful by actually participate in the nightly masscheck. Talk is cheap. I'm actually doing something. Warren
Re: Need Volunteers for Ham Trap
On 2/8/11 3:15 AM, Warren Togami Jr. wtog...@gmail.com wrote: I'm somewhat annoyed by the armchair quarterback negative comments on this topic. (Not just you) didn't read the rest of this thread to realize this particular concern is moot. Ditto. I don't really have time to participate in this activity, but the methodology is sound and provides a needed source of ham. Many people want these opt-in lists, and I don't want to block them. None of the people complaining about how this is such a bad idea are being helpful by actually participate in the nightly masscheck. I do participate in masschecks, primarily because I have a lot of mail from politicians (campaign pieces, updates from my congressman, notes from party officials, and the like) that was getting flagged as spam even though it is clearly opt in, and unsubscribing is clear and simple. The main corpus used in masschecks is the mail for a bunch of techies, and I had a divergent set of mail from this other interest in my life. Warren's project extends that concept much further than just the side-interests of a couple of us nerds/wonks. Talk is cheap. I'm actually doing something. Keep it up! Warren -- Daniel J McDonald, CCIE # 2495, CISSP # 78281
Re: Need Volunteers for Ham Trap
On 01/21/2011 01:06 AM, Warren Togami Jr. wrote: On 1/20/2011 7:23 AM, R - elists wrote: initially this came across as a really suspect idea... i.e., one man's junk is another man's treasure Ham is a lot easier to define than Spam. Ham is simply anything that you subscribed for. I am currently subscribed to number of mailing lists to collect ham emails (in addition to other sources). While it might be true that mailing lists can be good sources of ham, their emails do not contain realistic diversity of features/characteristics. In my view, the issue is not just insuring an email is ham, but also insuring that it contains realistic set of features. If the features are not realistic, and if we optimize tests scores based on that, then we might end up worsening test scores for realistic end-users. For example, most list emails are non-HTML. While most end-user ham and spam emails are HTML. Evaluating sets of features (or tests) based on this unrealistic corpus is likely to fools us into thinking that a feature/test is more effective that what it is in reality (i.e. we might end up giving MIME-based tests higher scores). Mahmoud
Re: What is Ham? (was Re: Need Volunteers for Ham Trap)
On Thu, 2011-01-20 at 21:50 -0800, Jeff Chan wrote: Yes and no. If you sign up for Joe's Bagel Company mailing list to find out about the latest Bagel news, and some new marketing guy joins the Bagel company and starts sending marketing messages about Bananas to that list, then the original purpose of the list and what you thought you signed up for has been corrupted. Most people would consider the latter to be spam, and rightly so. That's exactly what I what I was saying about BT. I'm signed up for a list whose stated purpose is to send billing and service change notifications about their phone service, but BT's marketing department suddenly started using it to flog broadband. I think most people would consider these sales messages to be spam and the BT sales department to be spammers. Under these circumstances I would *not* expect the list to get a Pure-Ham rating. Martin
RE: Need Volunteers for Ham Trap
This is a misunderstanding. I am largely against whitelisting or negative score rules. I merely intend to increase the variety of legitimate mail in the nightly ham corpus so our spam-hostile rules can be better tested for safety. This will be interesting especially with non-English ham. Warren Warren, so, are you going to keep two or more corpus datasets? one as it is, and one with the new for comparison? initially this came across as a really suspect idea... i.e., one man's junk is another man's treasure for a moment, it appeared we were gonna need to review the good and the bad of spam-l to avoid serious SA list issues. statistically speaking, this shouldnt sway the scoring substantially anyways would it? what should be known so that bad data is not allowed into the HAM corpus ? - rh
Re: Need Volunteers for Ham Trap
On Tue, Jan 18, 2011 at 12:59, Warren Togami Jr. wtog...@gmail.com wrote: On 1/17/2011 11:46 PM, Jeff Chan wrote: So a couple points: 1. Subscribing to lists opens up lots of grey areas including the above. 2. Some of the areas are very difficult to resolve into spam or ham. Some more aggressive anti-spammers may say all of the above is spam, but others may disagree, and the mail may be legal. Before anyone accuses me of being in favor of spammers, please be aware that I am personally highly against any of these unethical practices, but when essentially making decisions for others, one needs to be very careful and consider whether there may be legitimate, ethical, legal or even wanted uses of such things. One person's ham may be another persons spam, and vice versa. However, most people don't want the stuff bots send. The issue is complex, and there are many deliverability, security and anti-spam companies and organizations that struggle with these issues every day. Maintaining accurate ham and spam corpora and making policies for what belongs in which category is trivial in some easy cases like bot pill spam, but non-trivial in other cases. Cheers, Jeff C. I appreciate the nuanced feedback but I have thought of similar considerations. I believe the following will help to avoid ambiguity and legal issues. * Yes, we cannot be 100% sure our opt-in was only for that particular site and not their partners. But in any case automatic ham trapped mail will be only the mail branded by the subscribed provider, because that is the only mail we know for sure was opted-in. Anything else is kept separate for later analysis. * If clearly spammy other mail arrives at a particular address, the original subscription can be unsubscribed and the continued flow monitored. That address could then be discarded. +1 to those. tagged addressing makes this easy to implement (and track). I use this approach on a very small scale for a small number of ham newsletters in my own corpus... --j.
Re: Need Volunteers for Ham Trap
On 1/20/2011 7:23 AM, R - elists wrote: initially this came across as a really suspect idea... i.e., one man's junk is another man's treasure Ham is a lot easier to define than Spam. Ham is simply anything that you subscribed for. for a moment, it appeared we were gonna need to review the good and the bad of spam-l to avoid serious SA list issues. statistically speaking, this shouldnt sway the scoring substantially anyways would it? You are correct. This is more of a tool to have *some* variety in the ham corpus, to make it possible to flag rules in need of scrutiny. For example, prior to 3.3.x many of our rules were utterly broken with Japanese mail. We had no idea of this fact until I added a few thousand Japanese mail to the ham corpus. JM understood the problem and fixed those rules. what should be known so that bad data is not allowed into the HAM corpus ? The previous discussion described a sort of tagged sender ham trap. This simple process automatically excludes extraneous mail in cases where the address was shared with affiliates or spammer lists. We also will be careful in sticking to reputable companies and orgs for the ham trap. Warren
What is Ham? (was Re: Need Volunteers for Ham Trap)
On Thu, 20 Jan 2011 11:06:31 -1000 Warren Togami Jr. wtog...@gmail.com wrote: Ham is a lot easier to define than Spam. Ham is simply anything that you subscribed for. Not necessarily. You could subscribe to a list expecting it to contain useful content. A few months later, the organization running the list might decide to change what it posts and start posting undesired marketing information on the list. Is that still ham? Regards, David.
Re: What is Ham? (was Re: Need Volunteers for Ham Trap)
On 1/20/2011 4:10 PM, David F. Skoll wrote: On Thu, 20 Jan 2011 11:06:31 -1000 Warren Togami Jr. wtog...@gmail.com wrote: Ham is a lot easier to define than Spam. Ham is simply anything that you subscribed for. Not necessarily. You could subscribe to a list expecting it to contain useful content. A few months later, the organization running the list might decide to change what it posts and start posting undesired marketing information on the list. Is that still ham? Of course it is. You subscribed to it. If you don't want it anymore, unsubscribe. If you unsubscribe and they keep sending it anyway, THEN it becomes spam. -- Bowie
Re: What is Ham? (was Re: Need Volunteers for Ham Trap)
On Thu, 20 Jan 2011 16:12:58 -0500 Bowie Bailey bowie_bai...@buc.com wrote: Of course it is. You subscribed to it. If you don't want it anymore, unsubscribe. I disagree. When you subscribe to a list, there's an implicit understanding of the content you are signing up for. If the list owner violates the rules and posts marketing material, that's spam. Concrete example: If I posted an ad for our commercial anti-spam system on the MIMEDefang list, that would be spam. If I posted it on this list, it would be spam-squared and I'd probably be banned. :) Regards, David.
Re: What is Ham? (was Re: Need Volunteers for Ham Trap)
On 1/20/2011 4:17 PM, David F. Skoll wrote: On Thu, 20 Jan 2011 16:12:58 -0500 Bowie Bailey bowie_bai...@buc.com wrote: Of course it is. You subscribed to it. If you don't want it anymore, unsubscribe. I disagree. When you subscribe to a list, there's an implicit understanding of the content you are signing up for. If the list owner violates the rules and posts marketing material, that's spam. Concrete example: If I posted an ad for our commercial anti-spam system on the MIMEDefang list, that would be spam. If I posted it on this list, it would be spam-squared and I'd probably be banned. :) Public discussion lists are bit different. In that case, it is the individual post that is being considered spam rather than considering the list spammy. Since there is no overall control over the content of the posts, public lists are vulnerable to being filled with spam if the list owners are not paying attention. When you sign up for a company's email list, you get whatever they decide to send you. If they decide to start sending marketing to the list, I would not consider that spam because they own the list and they can decide what to use it for. The recipients signed up to get that company's emails and if they no longer want to receive them, they can unsubscribe. And as I said before, if the unsubscribe function doesn't work, then the emails become spam (regardless of the actual content). -- Bowie
Re: What is Ham? (was Re: Need Volunteers for Ham Trap)
On 01/20/2011 11:31 AM, Bowie Bailey wrote: Public discussion lists are bit different. In that case, it is the individual post that is being considered spam rather than considering the list spammy. Since there is no overall control over the content of the posts, public lists are vulnerable to being filled with spam if the list owners are not paying attention. For this reason, the ham trap will not be subscribed to any discussion lists. When you sign up for a company's email list, you get whatever they decide to send you. If they decide to start sending marketing to the list, I would not consider that spam because they own the list and they can decide what to use it for. The recipients signed up to get that company's emails and if they no longer want to receive them, they can unsubscribe. And as I said before, if the unsubscribe function doesn't work, then the emails become spam (regardless of the actual content). Your understanding is exactly correct. Warren
Re: What is Ham? (was Re: Need Volunteers for Ham Trap)
On Thu, 20 Jan 2011 16:31:50 -0500 Bowie Bailey bowie_bai...@buc.com wrote: When you sign up for a company's email list, you get whatever they decide to send you. OK. I guess we'll agree to disagree on our definitions, then. Regards, David.
Re: What is Ham? (was Re: Need Volunteers for Ham Trap)
On Thursday, January 20, 2011, 1:31:50 PM, Bowie Bailey wrote: On 1/20/2011 4:17 PM, David F. Skoll wrote: When you sign up for a company's email list, you get whatever they decide to send you. If they decide to start sending marketing to the list, I would not consider that spam because they own the list and they can decide what to use it for. The recipients signed up to get that company's emails and if they no longer want to receive them, they can unsubscribe. And as I said before, if the unsubscribe function doesn't work, then the emails become spam (regardless of the actual content). Yes and no. If you sign up for Joe's Bagel Company mailing list to find out about the latest Bagel news, and some new marketing guy joins the Bagel company and starts sending marketing messages about Bananas to that list, then the original purpose of the list and what you thought you signed up for has been corrupted. Most people would consider the latter to be spam, and rightly so. OTOH if the Bagel company decides to send non-Bagel messages to a Bagel specific list, then one knows exactly: 1. Who to blame 2. Where to unsubscribe 3. What went wrong etc. So at least there is a responsible party to hopefully act on unsubscriptions, fire the spammy marketer, etc. It's sort of a degenerate case of the degenerate case of email addresses going to to a third party, except it's the same party. Spam is easy. Ham is hard. Cheers, Jeff C. -- Jeff Chan mailto:je...@surbl.org http://www.surbl.org/
Re: Need Volunteers for Ham Trap
On Tuesday, January 18, 2011, 4:59:05 AM, Warren Jr. wrote: * Yes, we cannot be 100% sure our opt-in was only for that particular site and not their partners. But in any case automatic ham trapped mail will be only the mail branded by the subscribed provider, because that is the only mail we know for sure was opted-in. Anything else is kept separate for later analysis. * If clearly spammy other mail arrives at a particular address, the original subscription can be unsubscribed and the continued flow monitored. That address could then be discarded. Both seem reasonable approaches. Those degenerate cases of both are indeed interesting. Cheers, Jeff C. -- Jeff Chan mailto:je...@surbl.org http://www.surbl.org/
Re: Need Volunteers for Ham Trap
On 01/18/2011 11:49 PM, Jeff Chan wrote: On Tuesday, January 18, 2011, 4:59:05 AM, Warren Jr. wrote: * Yes, we cannot be 100% sure our opt-in was only for that particular site and not their partners. But in any case automatic ham trapped mail will be only the mail branded by the subscribed provider, because that is the only mail we know for sure was opted-in. Anything else is kept separate for later analysis. * If clearly spammy other mail arrives at a particular address, the original subscription can be unsubscribed and the continued flow monitored. That address could then be discarded. Both seem reasonable approaches. Those degenerate cases of both are indeed interesting. Cheers, Jeff C. Yes, I think this is a reasonably simple and effective plan. I only need volunteers to help me find appropriate sites and to help subscribe. It is very boring to do all this myself. Warren
Re: Need Volunteers for Ham Trap
On Monday, January 17, 2011, 10:52:58 PM, Warren Jr. wrote: Hi folks, Here is an opportunity for non-developers to do simple tasks to help improve Spamassassin. I am seeking volunteers to help me build and administrate a ham trap. The idea is to subscribe a list of unique e-mail addresses to various retailers, airlines, government and other legitimate bulk mail senders. A sufficient variety of ham trap subscriptions should increase the variety of legitimate senders represented in nightly masscheck and thus improve the safety of Spamassassin's rules. Benefits of the Ham Trap * Creation of an automated, synthetic source to build a corpus of very recent ham for the nightly masscheck. Ham trap data will be expired from the masscheck after 3 months. This will be fairly easy to maintain in a 99% automated fashion, ensuring a constant stream of fresh data for the nightly masscheck largely without the need for human sorting. * Help to identify legitimate bulk senders who are performing poorly with spamassassin. Our data may help legitimate senders to modify their mail practices to avoid spamminess. * Each subscription is a unique tracked address. This will make it possible to definitively identify bulk senders who violate their customer's privacy by selling their e-mail address list to others. There isn't much we can do about these cases other than shame them on a web page, but for spam fighters this is useful information. While I certainly would encourage improving ham and spam corpora, this proposal may open up a lot of grey areas that may be non-trivial to resolve. Some of the legitimate mailing lists that sell, share or rent their addresses to third party senders may be doing so legally if it's permitted in the terms of use one agrees to when signing up. Obviously such a practice is questionable at best in terms of ethics, but it may be technically legal. There are also affiliate marketing programs explicitly based on sharing opt in lists which may be even less ethical and apparently have many abusers. Such things may be legal while being unethical. So a couple points: 1. Subscribing to lists opens up lots of grey areas including the above. 2. Some of the areas are very difficult to resolve into spam or ham. Some more aggressive anti-spammers may say all of the above is spam, but others may disagree, and the mail may be legal. Before anyone accuses me of being in favor of spammers, please be aware that I am personally highly against any of these unethical practices, but when essentially making decisions for others, one needs to be very careful and consider whether there may be legitimate, ethical, legal or even wanted uses of such things. One person's ham may be another persons spam, and vice versa. However, most people don't want the stuff bots send. The issue is complex, and there are many deliverability, security and anti-spam companies and organizations that struggle with these issues every day. Maintaining accurate ham and spam corpora and making policies for what belongs in which category is trivial in some easy cases like bot pill spam, but non-trivial in other cases. Cheers, Jeff C. -- Jeff Chan mailto:je...@surbl.org http://www.surbl.org/
Re: Need Volunteers for Ham Trap
On Tue, 2011-01-18 at 01:46 -0800, Jeff Chan wrote: While I certainly would encourage improving ham and spam corpora, this proposal may open up a lot of grey areas that may be non-trivial to resolve. Agreed, and some companies will get to you sign up for accounting and service problem notifications and then pump advertising down the channel in such volume that the purpose for which you signed up seems utterly forgotten. British Telecom sets a bad example here: they even behave like a spammer inasmuch as they regularly vary their promotions text to dodge spam filters. I'd be worried that if word gets around that SA is developing rules that give signed-up bulk mail a free ride then a lot more companies will do the same. Martin
Re: Need Volunteers for Ham Trap
Le 18/01/2011 10:46, Jeff Chan a écrit : 2. Some of the areas are very difficult to resolve into spam or ham. Some more aggressive anti-spammers may say all of the above is spam, but others may disagree, and the mail may be legal. I'd suggest that SA ought to be classifying e-mail in *three* broad categories, not two. Firstly, definite spam, unsolicited in any way. Secondly, definite ham (i.e. primarily genuine person-to-person e-mail and actively solicited messages such as confirmations of website transactions), which even the most aggressive spam-fighters would agree is ham. FPs in this category are bad news. And thirdly, an in-between category, of which opt-in advertising is a prime example, which at least some users are happy to receive, but where FPs aren't a major problem. With a few relatively rare exceptions, SA already classifies these categories pretty effectively, especially with a well-trained bayesian db. Genuine ham tends to come in with negative scores, occasionally straying up to about 1 or 2. Likewise, undisputed spam rarely scores less than 8 or 10. And opt-in advertising typically comes in with neutral scores of 0 to 4. So far, so good. Using this opt-in advertising, which IMO ought to be getting neutral scores, as a ham corpus, is inevitably going to be problematic. Using it as a third, neutral corpus that is given far less weight than genuine ham would be a different matter, but would require a major change in the the scoring algorithms. John. -- -- Over 4000 webcams from ski resorts around the world - www.snoweye.com -- Translate your technical documents and web pages- www.tradoc.fr
Re: Need Volunteers for Ham Trap
On 1/17/2011 11:46 PM, Jeff Chan wrote: So a couple points: 1. Subscribing to lists opens up lots of grey areas including the above. 2. Some of the areas are very difficult to resolve into spam or ham. Some more aggressive anti-spammers may say all of the above is spam, but others may disagree, and the mail may be legal. Before anyone accuses me of being in favor of spammers, please be aware that I am personally highly against any of these unethical practices, but when essentially making decisions for others, one needs to be very careful and consider whether there may be legitimate, ethical, legal or even wanted uses of such things. One person's ham may be another persons spam, and vice versa. However, most people don't want the stuff bots send. The issue is complex, and there are many deliverability, security and anti-spam companies and organizations that struggle with these issues every day. Maintaining accurate ham and spam corpora and making policies for what belongs in which category is trivial in some easy cases like bot pill spam, but non-trivial in other cases. Cheers, Jeff C. I appreciate the nuanced feedback but I have thought of similar considerations. I believe the following will help to avoid ambiguity and legal issues. * Yes, we cannot be 100% sure our opt-in was only for that particular site and not their partners. But in any case automatic ham trapped mail will be only the mail branded by the subscribed provider, because that is the only mail we know for sure was opted-in. Anything else is kept separate for later analysis. * If clearly spammy other mail arrives at a particular address, the original subscription can be unsubscribed and the continued flow monitored. That address could then be discarded. Warren
Re: Need Volunteers for Ham Trap
On 1/18/2011 1:15 AM, Martin Gregorie wrote: On Tue, 2011-01-18 at 01:46 -0800, Jeff Chan wrote: While I certainly would encourage improving ham and spam corpora, this proposal may open up a lot of grey areas that may be non-trivial to resolve. Agreed, and some companies will get to you sign up for accounting and service problem notifications and then pump advertising down the channel in such volume that the purpose for which you signed up seems utterly forgotten. British Telecom sets a bad example here: they even behave like a spammer inasmuch as they regularly vary their promotions text to dodge spam filters. I'd be worried that if word gets around that SA is developing rules that give signed-up bulk mail a free ride then a lot more companies will do the same. This is a misunderstanding. I am largely against whitelisting or negative score rules. I merely intend to increase the variety of legitimate mail in the nightly ham corpus so our spam-hostile rules can be better tested for safety. This will be interesting especially with non-English ham. Warren
Re: Need Volunteers for Ham Trap
On 1/18/11 12:52 AM, Warren Togami Jr. wtog...@gmail.com wrote: I am seeking volunteers to help me build and administrate a ham trap. The idea is to subscribe a list of unique e-mail addresses to various retailers, airlines, government and other legitimate bulk mail senders. The possible fly in the ointment I see is that you wouldn't necessarily have access to some sorts of transactional emails-- airline flight reminders and things of that nature. Would that be something where you'd be interested in getting mail cc:ed to a hamtrap address? For example, I use tagged email addresses for different airlines, and it would be trivial for me to have my server relay those messages to a hamtrap address as well as delivering to my personal email if that sort of thing would be useful. -- Dave Pooser Cat-Herder-in-Chief Pooserville.com
Re: Need Volunteers for Ham Trap
On 01/18/2011 03:25 PM, Dave Pooser wrote: On 1/18/11 12:52 AM, Warren Togami Jr.wtog...@gmail.com wrote: I am seeking volunteers to help me build and administrate a ham trap. The idea is to subscribe a list of unique e-mail addresses to various retailers, airlines, government and other legitimate bulk mail senders. The possible fly in the ointment I see is that you wouldn't necessarily have access to some sorts of transactional emails-- airline flight reminders and things of that nature. Would that be something where you'd be interested in getting mail cc:ed to a hamtrap address? For example, I use tagged email addresses for different airlines, and it would be trivial for me to have my server relay those messages to a hamtrap address as well as delivering to my personal email if that sort of thing would be useful. You are correct that this isn't transactional mail. It is however low-effort automatic collection of a subset of ham that real users receive, much of which we are entirely missing from the nightly corpus. https://fedorahosted.org/auto-mass-check/ As for the ham you suggest, I highly suggest running your own nightly masscheck and uploading logs. This avoids privacy problems and allows you to check/correct quality issues in your own corpus. Warren