On Tuesday, December 21, 2004, 4:49:33 AM, Markus wrote: MG> First of all spam is anything MG> comming from nonexistant, or forged senders MG> having "hidden" content
MG> But what you're asking for is the difference between our MG> human brain and stupid computers (Pete, your comment please ;-) Well... I'm having fun lurking and I don't want to spoil that. I'm anxious to learn what folks are thinking about all of this (without my nudging). The current implementation of Sniffer is a kind of broad spectrum hybrid learning system. We use statistical models to try and keep the core rulebase targeting what our users _seem_ to want filtered then we customize individual rulebases to match specific preferences. The learning model isn't perfect, but it has shown that by and large there is a strong agreement for most folks about what should be filtered - even if that definition cannot be clearly and consistently stated. (Note I did not say "what is spam" because that is getting to be more precise and more contentious these days.) What I find (and it really stands out when working with Matt) is that the definition indicated by the standing rules in our core rulebase is a mixed bag of features and that the definition is highly fluid around the edges. For example, in large part Matt's rules would indicate traffic from chtah is "not spam" but even he admits it's not acceptable to make that definition hard (not ok to white-list chtah). One more liberal definition of ham holds that if the recipient has a first party relationship with the sender then any content from that sender should not be filtered... Clearly from the volume of direct advertising that is submitted to us as spam (even as recurring spam problems) this definition does not hold for most of our users. This "edge definition problem" was predicted and so far our model is doing a reasonably good job of dealing with it - though improvements are clearly needed and are on their way (albeit slowly). In the mean time, end-user specific bayesian classification can often solve the edge problem -- thus reinforcing that the fluidity at the edge is largely due to differences in the filtering preferences of the end users and the variability thereof. Add to that the problem of data collection and the problem becomes not only difficult to solve, but difficult to measure --- Imagine piloting a supersonic fighter jet through a narrow winding canyon with your eyes shut and you've just about got the picture. As for the stupidity of machines... I personally believe that strong intelligence can be built artificially (and in fact I do that for fun and profit)... The big challenge with using AI for spam is the same as for many AI systems where people's expectations are concerned: The AI cannot and does not have a human frame of reference and so even if it did match or exceed the innate intelligence of a human counterpart, it would not be in a position to predict or model human behaviors precisely. Said another way (partly tongue in cheek) - since computers don't have sex, they don't grok porn and (ahem) organ enhancement spam. Without a social frame of reference they are reduced to guessing at otherwise meaningless patterns. You or I could do no better in that world. So, what we do with the design of Sniffer is to build a highly integrated hybrid with both human and machine components. Each gives the other strong leverage where it's needed. The machines remember better than we do, find and learn patterns well, and manage large datasets without too much effort. The humans understand the social contexts, predict and decode the strategies that are used by spammers, and interpret the needs and desires of our customers. I think I might be rambling... Were these the kinds of comments you were looking for? _M --- [This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)] --- This E-mail came from the Declude.JunkMail mailing list. To unsubscribe, just send an E-mail to [EMAIL PROTECTED], and type "unsubscribe Declude.JunkMail". The archives can be found at http://www.mail-archive.com.