[sniffer] My issues with the General category, looking for a better solution

Matt Wed, 15 Dec 2004 20:40:10 -0800

Pete and other Sniffer Customers,

I've been having a lot of issues with false positives in the General category, and I'm in search of a better way to handle such things after making little progress without a large time commitment to the issue that this creates.

The General category seemingly primarily consists of E-mail that comes from spam reports by Sniffer 's customers, and didn't' hit one of Sniffer's spam traps. Since I only monitor a certain range of E-mail that just barely manages to fail my system, I often times find that such messages that are tagged with Sniffer General and fall in this range are what I consider to be false positives, and originate from bulk mail providers such as CheetahMail, DartMail, etc., or come directly from first-parties such as Amazon, Target, eDiets, etc.

Recently I undertook a large undertaking of identifying the bulk-mail providers by both IP block and reverse DNS entries so that I could segregate this content from the other stuff, and also defeat other filters that I use in my Declude setup that produce somewhat random results, but weren't intended to target E-mail of this variety (such as BADHEADERS, SPAMHEADERS, GIBBERISH, BASE64SUB and others). I then assigned a base score for each of these providers in 4 levels based on the trustworthiness of the provider, some are automatically held or deleted on my system. This gives me a predictable base score on top of which scores from Sniffer, SpamCop and SURBL are primarily the deciding factor in causing the E-mails to be held. Unfortunately, this exposed a large number of false positives primarily in Sniffer-General, but also in Sniffer-Experimental that were sneaking in under the limit or were otherwise not found when the E-mail's were not being segregated. It is my quest to fix these issues as they account for over 3/4 of all of my false positives. Marcus' own statistics suggest only about an 80% accuracy for this group of rules.

I've narrowed down what I feel is really at issue here, so let me summarize and then discuss:

1) Sniffer customers reporting advertising related E-mail that comes from companies with first-party relationships with the recipients (though mostly never gave direct permission to add them to lists).

2) Overbroad rules generated by Sniffer. This includes things such as tagging a bulk-mail provider's domain for a violation of one of their customer's, and generating rules from things like tracking links or image hosts, and occasionally phrase and more broadly coded filters (such as *offers@).

3) Rules that target things that other rules that I have asked to be blocked cause repeated false positives despite my efforts to stop such things from occurring.

As far as the first item goes, this is primarily an issue with the fact that everyone has different standards for what they consider to be spam, and we are most likely to disagree about things that fall into this gray category where first-party relationships between the sender and recipient often exist, but with varying levels of abuse that results from many different types of circumstances. For instance, many really hate Orbit, Travelocity, Expedia and Hotwire ads, but they are sent, from what I can tell, exclusively to their customers. It's the topic and the frequency that makes people consider it to be spam, but they do all honor opt-outs from what I can tell, and just today for instance, a customer of mine reported a very low value Orbit ad as a false positive. I have had experiences where I have asked that rules be blocked for the same source on three different occasions because seemingly as fast as Pete removes them according to his rules, new ones appear.

I do maintain my own whitelist for such things, but I also make it a practice to report such things to Sniffer because I am not sure what rule might have tripped and what other issues might be caused by such rules if they aren't removed from my rulebase. My whitelist is specifically targeted and doesn't always prevent future rules from causing issues on my system. I am also hesitant to request white rules because spammers will domain stuff in order to exploit such things or throw off URL parsers. So the net effect of all of this is that whitelisting is only partially successful and it takes me considerable time to report, whitelist and monitor on a continual basis. I'm sure that I am pissing off some other people by submitting FP's that defeat their FN reports.

I think there needs to be a change in the way that this is handled and I have a couple of ideas. The first idea would be to implore other Sniffer customers to not report E-mail that they might find objectionable, but have no proof of it being sent to people that don't have a first-party relationship with the company or newsletter, or no proof of the company not honoring opt-outs. When I get such reports from my customers, I unsubscribe them and have never had an issue doing so. Naturally I don't unsubscribe to spam houses. When an administrator with tight rules for their system due to things like not allowing non-business content for a corporate server comes across such things, I would request that they blacklist these things locally instead of automatically reporting such things to Sniffer. The second idea would be for Sniffer to adopt a new method of handling such submissions from bulk mail providers where it takes multiple manual submissions for a source that originates from a legitimate bulk mail provider or first-party advertiser or newsletter provider before the global rule is generated, and create a local black rule for the customer that submitted it. This will raise the bar on submissions so that a single admin's own tolerance for this stuff doesn't affect the entire community. I would be happy to share with Sniffer my data on bulk-mail providers in order to help identify such sources, and I plan on continually updating this due to the extent of the problem.

As far as #2 (overbroad rules), this is a somewhat less problematic contributer to the issue at hand, but I am hoping that there could be a way to further enhance the detection of such overbroad rules by creating a database of such things over time so that they won't continually create issues. This might exist already, but it still occasionally contributes to the problems that I see so at minimum continued diligence on the matter would be appreciated.

#3 (reoccurring rules) could be solved through similar means as #2, or perhaps a new strategy for qualifying rules that would hit things that have previously been depreciated. Maybe keeping a database of sources based on FP reports and checking new rules against them would be a good way to qualify the rules. It has been noted for instance in bayesian filtering that the false positive problem can be reduced by weighting the good words in ham disproportionally with the spam words, and without the effect of substantially weakening the capture rates. I think the same issues might apply here where a rule that was once responsible for a reported false positive should hold more clout in a system than a newly generated and unproven rule based on in large part the automation of a system.

Sorry for the length, but I really am hoping for a way to improve this situation and help reduce the workload that it creates for administrators like myself that seek to tightly manage their system.

Thanks,

Matt

-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=====================================================

[sniffer] My issues with the General category, looking for a better solution

Reply via email to