Serious Proposal to add AccuTechnology(tm) to SpamAssassin (SA) ==================================================
This is my first post to this SA dev list and apologies if it is abrupt and lengthy. I am too busy to lurk here for a while to build a reputation here. This proposal is directed to the Apache Software Foundation (ASF) SpamAssassin open source project. Not sure if starting a thread here in SA dev list is the appropriate way to initiate this sort of proposal. Should I instead be communicating first with Project Managment Committee (at pmc /at/ spamassassin.apache.org)? I would like to propose integrating AccuTechnology(tm) with SpamAssassin (SA) open source distribution. The integration would I imagine be somewhat similar to past integration with Razor and other bulk correlation databases. I am proposing that I will be tasked to do most of the work (in areas I am knowledgeable), and hoping to get advice along the way from experienced SA developers (in areas where I am not knowledgeable). My name is Shelby Moore, and I am the inventor of AccuTechnology(tm), a new statistical method for anti-spam, as summarized non-technically here: http://AccuSpam.com/accuspam.php More about me here: http://AccuSpam.com/about.php Let us be clear that AccuTechnology is *fundamentally* different than other bulk correlating anti-spam (e.g. Razor, DCC, Commtouch, etc.) in a way that enables it to detect much more spam with much lower false positive rate. Without reading our patent application, then you need to be thinking "fully automated (no BLOC employees, no manual training) BrightMail/Commtouch with a similarities+differences to Chung-Kwei, Support Vector Machines, multi-user Bayesian". To address possible misunderstandings from above web page, let me make a few critical points: ========================== a) BENEFIT: SpamAssassin is already an excellent product, but all products have some (even if few) weaknesses. My goal with this proposal is to make SpamAssassin leap into an "order-of-magnitude" better performance than other Bayesian filters, while maintain and amplifying SA's ability to excel WITHOUT manual training: http://sam.holden.id.au/writings/spam2/ (shows that SA is similar to other Bayesian filters, when manual training is used) Thus in essense my goal is to make SpamAssassin even more attractive to enterprises out-of-the-box than it is now : http://www.nwfusion.com/reviews/2004/122004spamcharts.html (Note NoSpamToday has/had SA core and did not compare well in this major and ongoing review) Note that Bayesian has a fundamental limit on performance due to it's inherent statistical power, and AccuTechnology(tm) in theory breaks free of this limit: http://crm114.sourceforge.net/Plateau_Paper.pdf (Yerazunis, W., 2004, The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. MIT Spam Conference. Cambridge, MA) b) MARKETING: Bayesian and AccuTechnology(tm) will complement each other very nicely to make a much better anti-spam. One is strong where the other is weak (more on that below). Current AccuSpam.com marketing talks about the weaknesses of Bayesian compare to AccuTechnology, but it does not yet talk about the strengths of the two working together, because we do not yet have a project (which does that) to promote. A purpose of making this proposal is to enable us to promote SpamAssassin (i.e. Bayesian) in unison with AccuTechnology. Please do not react negatively to marketing. Let's focus on doing good work on anti-spam together (i.e. the "more tests is better" philosophy of SA). c) INTEGRATION: afaik SpamAssassin's AWL (auto-whitelist) fits very nicely with a requirement of AccuTechnology(tm) in no manual training mode. See proposed integration API below. d) EVALUATION: AccuTechnology is not AccuSpam. AccuSpam uses some aspects of AccuTechnology but adds other things which are unnecessary for a SA integration. Current implementation of AccuSpam.com (is an alpha release) is not always a good demonstration of the core AccuTechnology(tm) for reasons I can detail in later discussion. For one thing, AccuTechnology does NOT require an auto-response and does NOT require a Daily Summary (those are features of AccuSpam, not of AccuTechnology). So don't jump to conclusions by joining AccuSpam and looking for problems. The bulk correlating nature of AccuTechnology(tm) is that we need a public alpha release. It is by no means a RC or commercial product yet. Thus this proposal is FORWARD LOOKING towards this summer, by which time we expect to be a commercial product. We would like to get working now on integration with SA. According to industry predictions I've seen, anti-spam adoption by ISPs is heading towards saturation (or accelerating) and 2005-6 is a critical period for major players to emerge and rest to fade away. So we do/plan/discuss things on a forward looking stance to optimize timing. Also AccuSpam went through several experiments in the past 2 years before settling on AccuTechnology as it is today, beginning around December 2004. So data (e.g. google) about older AccuSpam versions (pre-alpha experiments) is not relevant. e) PERFORMANCE: A sampling of 230 active AccuSpam users, the bottom line is that some (myself included) are getting 99.5+% spam deletion with unmeasurable (below our current sample size frequency) false positives. Others are getting any where between 80 - 99% (weighted avg is about 95% but in flux as we fine-tune false positive fixes), and this is because with only 230 users, our global sample of spam is not very complete (significant). Compare the magnitude of 230 with the 10000s or more users of Razor or DCC and it becomes clear that AccuSpam's performance with only 230 is exceptional! AccuTechnology needs to see a lot of spam per day, before it really kicks into high gear. We are still working on measuring and correcting false positives issues which may exist (e.g. we fixed a critical bug yesterday where our stopwords array was not populated due to missed assignment = operator). Bottom line is that as AccuTechnology usership increases, expect the spam deletion rate to head north of 99% for most recipients. This has to do with the statistical power of the algorithm as compared to Bayesian. I think this would be a very exciting development for SA, especially with the integration of AWL (heuristics), Bayesian, and AccuTechnology. As you will read below, AccuSpam measure something different than Bayesian and thus the two may supplement each other quite effectively. As discussion continues, we can detail more on performance. f) PATENT: Our goal with the proposal is the make AccuTechnology(tm) FREE for small organizations (no profit to be made there any way), and extremely low cost (as in dimes per email account) for large organizations. Our goal with our soon-to-be-patented AccuTechnology is simply to earn a decent ROI so that we can pour more investment back into it, finance our overhead to provide the centralized database, and thus to optimize the spread of the algorithm and thus the improvement in anti-spam. As proposed, only the API code for AccuTechnology will be integrated into SpamAssassin. The soon-to-be-patented AccuTechnology will reside on our server and not be part of SpamAssassin distribution. Essentially, afaik our proposed integration model is not much different than CloudMark (Razor), which SA already integrates with, but with much greater anticipated benefit to SA performance! We propose below how we think this can be accomplished within the (no changes to the existing) Apache license. Our intention will be to widely license AccuTechnology with ridiculously insignificant royalties (compared to organization size). See below for more details. We are not trying to use a patent to slow adoption or to injure any one. We are against trivial software patents. We are in support of complex system patents, which require huge investment to develop, and just happen to use software+hardware in one embodiment. We believe in strongly open source and I have made some contributions (some accepted and other rebuked). We would like to open source the entire implementation once we get some momentum (and get an issued patent). ========================== Discussion of Relative Strengths of Bayesian and AccuTechnology ================================================ All anti-spam are based on correlating patterns seen before: a) Heuristics characterize past patterns in a rule. b) single-user Bayesian correlates past patterns in email to same recipient c) multi-user Bayesian correlates past patterns in email to same and many recipients d) Bulk correlators do same as #c but do not have to be trained on classification of past patterns, but are static on which patterns they correlate. e) AccuTechnology does #d and automatically (dynamically) finds the best patterns Thus Bayesian can supplement AccuTechnology by classifying non-bulk (or outside the multi-user statistical sample) spam that recipient sees often, and AccuTechnology supplements Bayesian by detecting new spam patterns automatically (without training) in real-time. Discussion of Proposed Integration APIs and Licensing ======================================== Only the API code for AccuTechnology will be integrated into SpamAssassin. The soon-to-be-patented AccuTechnology will reside on our server and not be part of SpamAssassin distribution. Network calls to AccuTechnology: class AccuTechnology { // Returns license string which can be used for one message string RequestFreeMessageLicenseInstance( string msg ) // Returns the spam probability of input message, 0.5 means "unknown classification" // The input license string may be from RequestFreeMessageLicenseInstance() or // it may be a free, free-trial, or purchased license from AccuSpam // The returned spam probability is based on a confidence internal of the AccuTechnology sample. float MessageSpamProbability( string msg, string license, boolean is_awl, string unique_recipient_id ) } Licensing Issues: a) AccuTechnology::MessageSpamProbability() will return != 0.5 only for every Nth call to AccuTechnology::RequestFreeMessageLicenseInstance(), where N will probably be 5 or 10. It will checksum the message to make sure multiple calls to AccuTechnology::RequestFreeMessageLicenseInstance() can not be used to subvert. The other intervening calls will not send any message data over the network. Thus it will boost performance of SA, but not return useful information for every message in this "no registration" method of the free mode. b) Thus organizations who like the boost they get from #a, may wish to register at AccuSpam.com to obtain a free, free-trial, or purchased license string, which enables every message to be evaluated by AccuTechnology::MessageSpamProbability(). Our current rough guideline (intention) for such licensing is (subject to change and we reserve all rights): * Registration for perpetual free licenses for organizations under 100 email accounts. * Registration for 90 day free-trial licenses for organizations with 100 or more email accounts. * Registraion for purchased licenses for organizations with 100 or more email accounts: -- $5 * log( 2 ) / log( # email accounts / 50 ) per email account per year -- thus: 100 = $5 ea, 200 = $3 ea, 500 = $1.80 ea, 1000 = $1.38 ea, 10000 = $0.78 ea, 100000 = $0.54 ea -- even lower pricing for 10000+ licensees who host the centralized database * Discounts or free for educational organizations Network/Privacy Issues: a) AccuTechnology::MessageSpamProbability() will send From:, Sender:, Mailing-List:, Subject:, and body of message over the network to centralized database. Other headers may be added in future, but currently this is not forseen. The database does not store these messages, it only stores statistics on this data, nor does it normally store statistics correlated to unique_recipient_id. The only exception is it does store statistics (not messages) correlated to unique_recipient_id, for those few data where those statistics are exceptionally different from the global statistics. In short, there is no way to use the centralized database to recompose meaningful messages or to correlate any messages to unique_recipient_id. In the near future, we can apply some optimizations to larger organizations (with a customized SpamAssassin) such that none of the message is sent over the network. All feedback and discussion is welcome and appreciated. I am sure we can improve our proposal with your input. LEGAL: AccuSpam, 3Dize, Inc., and I reserve all rights. Only an agreed contribution to an Apache project by us will alter these rights with respect to any contribution. The above is discussion only, and not a contribution under the Apache license. Kind Regards, Shelby Moore III "Information knows no master which like a river can never be permanently impeded from reaching it's destination and thus source." CEO 3Dize, Inc. (coolpage.com) CEO DownloadFAST.com, Inc. founder and main programmer of AccuSpam.com* (AntiViotic.com) main programmer of Cool Page* (1998-), Art-O-Matic* (1996-8), WordUp* (1986-90), TurboJet (1988) contributing programmer to DownloadFAST.com* (2001-), Corel Painter* (1993-5 at Fractal Design Corp), Corel ArtDabbler, EOS PhotoModeler (1996), FONTZ! (1988) founder and main programmer of coming soon Paytector(tm).com inventor of coming coon FlexCanvas(tm) * denotes major involvement in massive multi-year R&D projects with millions of characters (1000s of pages) of code