Serious Proposal to add AccuTechnology(tm) to SpamAssassin (SA)
==================================================

This is my first post to this SA dev list and apologies if it is abrupt and 
lengthy.  I am too busy to lurk here for a while to build a reputation here.  
This proposal is directed to the Apache Software Foundation (ASF) SpamAssassin 
open source project.

Not sure if starting a thread here in SA dev list is the appropriate way to 
initiate this sort of proposal.  Should I instead be communicating first with 
Project Managment Committee (at pmc /at/ spamassassin.apache.org)?

I would like to propose integrating AccuTechnology(tm) with SpamAssassin (SA) 
open source distribution.

The integration would I imagine be somewhat similar to past integration with 
Razor and other bulk correlation databases.  I am proposing that I will be 
tasked to do most of the work (in areas I am knowledgeable), and hoping to get 
advice along the way from experienced SA developers (in areas where I am not 
knowledgeable).

My name is Shelby Moore, and I am the inventor of AccuTechnology(tm), a new 
statistical method for anti-spam, as summarized non-technically here:

http://AccuSpam.com/accuspam.php

More about me here:

http://AccuSpam.com/about.php

Let us be clear that AccuTechnology is *fundamentally* different than other 
bulk correlating anti-spam (e.g. Razor, DCC, Commtouch, etc.) in a way that 
enables it to detect much more spam with much lower false positive rate.  
Without reading our patent application, then you need to be thinking "fully 
automated (no BLOC employees, no manual training) BrightMail/Commtouch with a 
similarities+differences to Chung-Kwei, Support Vector Machines, multi-user 
Bayesian".

To address possible misunderstandings from above web page, let me make a few 
critical points:

==========================

a) BENEFIT: SpamAssassin is already an excellent product, but all products have 
some (even if few) weaknesses.  My goal with this proposal is to make 
SpamAssassin leap into an "order-of-magnitude" better performance than other 
Bayesian filters, while maintain and amplifying SA's ability to excel WITHOUT 
manual training:

http://sam.holden.id.au/writings/spam2/
(shows that SA is similar to other Bayesian filters, when manual training is 
used)

Thus in essense my goal is to make SpamAssassin even more attractive to 
enterprises out-of-the-box than it is now :

http://www.nwfusion.com/reviews/2004/122004spamcharts.html
(Note NoSpamToday has/had SA core and did not compare well in this major and 
ongoing review)

Note that Bayesian has a fundamental limit on performance due to it's inherent 
statistical power, and AccuTechnology(tm) in theory breaks free of this limit:

http://crm114.sourceforge.net/Plateau_Paper.pdf
(Yerazunis, W., 2004, The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and 
How to Get Past It. MIT Spam Conference. Cambridge, MA)

b) MARKETING: Bayesian and AccuTechnology(tm) will complement each other very 
nicely to make a much better anti-spam.  One is strong where the other is weak 
(more on that below).  Current AccuSpam.com marketing talks about the 
weaknesses of Bayesian compare to AccuTechnology, but it does not yet talk 
about the strengths of the two working together, because we do not yet have a 
project (which does that) to promote.  A purpose of making this proposal is to 
enable us to promote SpamAssassin (i.e. Bayesian) in unison with 
AccuTechnology.  Please do not react negatively to marketing.  Let's focus on 
doing good work on anti-spam together (i.e. the "more tests is better" 
philosophy of SA).

c) INTEGRATION: afaik SpamAssassin's AWL (auto-whitelist) fits very nicely with 
a requirement of AccuTechnology(tm) in no manual training mode.  See proposed 
integration API below.

d) EVALUATION: AccuTechnology is not AccuSpam.  AccuSpam uses some aspects of 
AccuTechnology but adds other things which are unnecessary for a SA 
integration.  Current implementation of AccuSpam.com (is an alpha release) is 
not always a good demonstration of the core AccuTechnology(tm) for reasons I 
can detail in later discussion.  For one thing, AccuTechnology does NOT require 
an auto-response and does NOT require a Daily Summary (those are features of 
AccuSpam, not of AccuTechnology).  So don't jump to conclusions by joining 
AccuSpam and looking for problems.  The bulk correlating nature of 
AccuTechnology(tm) is that we need a public alpha release.  It is by no means a 
RC or commercial product yet.  Thus this proposal is FORWARD LOOKING towards 
this summer, by which time we expect to be a commercial product.  We would like 
to get working now on integration with SA.  According to industry predictions 
I've seen, anti-spam adoption by ISPs is heading towards saturation (or 
accelerating) and 2005-6 is a critical period for major players to emerge and 
rest to fade away.  So we do/plan/discuss things on a forward looking stance to 
optimize timing.

Also AccuSpam went through several experiments in the past 2 years before 
settling on AccuTechnology as it is today, beginning around December 2004.  So 
data (e.g. google) about older AccuSpam versions (pre-alpha experiments) is not 
relevant.

e) PERFORMANCE: A sampling of 230 active AccuSpam users, the bottom line is 
that some (myself included) are getting 99.5+% spam deletion with unmeasurable 
(below our current sample size frequency) false positives.  Others are getting 
any where between 80 - 99% (weighted avg is about 95% but in flux as we 
fine-tune false positive fixes), and this is because with only 230 users, our 
global sample of spam is not very complete (significant).  Compare the 
magnitude of 230 with the 10000s or more users of Razor or DCC and it becomes 
clear that AccuSpam's performance with only 230 is exceptional!  AccuTechnology 
needs to see a lot of spam per day, before it really kicks into high gear.  We 
are still working on measuring and correcting false positives issues which may 
exist (e.g. we fixed a critical bug yesterday where our stopwords array was not 
populated due to missed assignment = operator).  Bottom line is that as 
AccuTechnology usership increases, expect the spam deletion rate to head north 
of 99% for most recipients.  This has to do with the statistical power of the 
algorithm as compared to Bayesian.  I think this would be a very exciting 
development for SA, especially with the integration of AWL (heuristics), 
Bayesian, and AccuTechnology.  As you will read below, AccuSpam measure 
something different than Bayesian and thus the two may supplement each other 
quite effectively.  As discussion continues, we can detail more on performance.

f) PATENT: Our goal with the proposal is the make AccuTechnology(tm) FREE for 
small organizations (no profit to be made there any way), and extremely low 
cost (as in dimes per email account) for large organizations.  Our goal with 
our soon-to-be-patented AccuTechnology is simply to earn a decent ROI so that 
we can pour more investment back into it, finance our overhead to provide the 
centralized database, and thus to optimize the spread of the algorithm and thus 
the improvement in anti-spam.  As proposed, only the API code for 
AccuTechnology will be integrated into SpamAssassin.  The soon-to-be-patented 
AccuTechnology will reside on our server and not be part of SpamAssassin 
distribution.

Essentially, afaik our proposed integration model is not much different than 
CloudMark (Razor), which SA already integrates with, but with much greater 
anticipated benefit to SA performance!  We propose below how we think this can 
be accomplished within the (no changes to the existing) Apache license.  Our 
intention will be to widely license AccuTechnology with ridiculously 
insignificant royalties (compared to organization size).  See below for more 
details.

We are not trying to use a patent to slow adoption or to injure any one.  We 
are against trivial software patents.  We are in support of complex system 
patents, which require huge investment to develop, and just happen to use 
software+hardware in one embodiment.  We believe in strongly open source and I 
have made some contributions (some accepted and other rebuked).  We would like 
to open source the entire implementation once we get some momentum (and get an 
issued patent).
==========================


Discussion of Relative Strengths of Bayesian and AccuTechnology
================================================

All anti-spam are based on correlating patterns seen before:

a) Heuristics characterize past patterns in a rule.
b) single-user Bayesian correlates past patterns in email to same recipient
c) multi-user Bayesian correlates past patterns in email to same and many 
recipients
d) Bulk correlators do same as #c but do not have to be trained on 
classification of past patterns, but are static on which patterns they 
correlate.
e) AccuTechnology does #d and automatically (dynamically) finds the best 
patterns

Thus Bayesian can supplement AccuTechnology by classifying non-bulk (or outside 
the multi-user statistical sample) spam that recipient sees often, and 
AccuTechnology supplements Bayesian by detecting new spam patterns 
automatically (without training) in real-time.


Discussion of Proposed Integration APIs and Licensing
========================================

Only the API code for AccuTechnology will be integrated into SpamAssassin.  The 
soon-to-be-patented AccuTechnology will reside on our server and not be part of 
SpamAssassin distribution.

Network calls to AccuTechnology:

class AccuTechnology
{
   // Returns license string which can be used for one message
   string RequestFreeMessageLicenseInstance( string msg )

   // Returns the spam probability of input message, 0.5 means "unknown 
classification"
   // The input license string may be from RequestFreeMessageLicenseInstance() 
or
   // it may be a free, free-trial, or purchased license from AccuSpam
   // The returned spam probability is based on a confidence internal of the 
AccuTechnology sample.
   float MessageSpamProbability( string msg, string license, boolean is_awl, 
string unique_recipient_id )
}

Licensing Issues:

a) AccuTechnology::MessageSpamProbability() will return != 0.5 only for every 
Nth call to AccuTechnology::RequestFreeMessageLicenseInstance(), where N will 
probably be 5 or 10.  It will checksum the message to make sure multiple calls 
to AccuTechnology::RequestFreeMessageLicenseInstance() can not be used to 
subvert.  The other intervening calls will not send any message data over the 
network.  Thus it will boost performance of SA, but not return useful 
information for every message in this "no registration" method of the free mode.

b) Thus organizations who like the boost they get from #a, may wish to register 
at AccuSpam.com to obtain a free, free-trial, or purchased license string, 
which enables every message to be evaluated by 
AccuTechnology::MessageSpamProbability().

Our current rough guideline (intention) for such licensing is (subject to 
change and we reserve all rights):

        * Registration for perpetual free licenses for organizations under 100 
email accounts.
        * Registration for 90 day free-trial licenses for organizations with 
100 or more email accounts.
        * Registraion for purchased licenses for organizations with 100 or more 
email accounts:
                -- $5 * log( 2 ) / log( # email accounts / 50 ) per email 
account per year
                --  thus: 100 = $5 ea, 200 = $3 ea, 500 = $1.80 ea, 1000 = 
$1.38 ea, 10000 = $0.78 ea, 100000 = $0.54 ea
                -- even lower pricing for 10000+ licensees who host the 
centralized database
        * Discounts or free for educational organizations


Network/Privacy Issues:

a) AccuTechnology::MessageSpamProbability() will send From:, Sender:, 
Mailing-List:, Subject:, and body of message over the network to centralized 
database.  Other headers may be added in future, but currently this is not 
forseen.  The database does not store these messages, it only stores statistics 
on this data, nor does it normally store statistics correlated to 
unique_recipient_id.  The only exception is it does store statistics (not 
messages) correlated to unique_recipient_id, for those few data where those 
statistics are exceptionally different from the global statistics.  In short, 
there is no way to use the centralized database to recompose meaningful 
messages or to correlate any messages to unique_recipient_id.

In the near future, we can apply some optimizations to larger organizations 
(with a customized SpamAssassin) such that none of the message is sent over the 
network.

All feedback and discussion is welcome and appreciated.  I am sure we can 
improve our proposal with your input.


LEGAL: AccuSpam, 3Dize, Inc., and I reserve all rights.  Only an agreed 
contribution to an Apache project by us will alter these rights with respect to 
any contribution.  The above is discussion only, and not a contribution under 
the Apache license.


Kind Regards,
Shelby Moore III

"Information knows no master which like a river can never be permanently 
impeded from reaching it's destination and thus source."

CEO 3Dize, Inc. (coolpage.com)
CEO DownloadFAST.com, Inc.
founder and main programmer of AccuSpam.com* (AntiViotic.com)
main programmer of Cool Page* (1998-), Art-O-Matic* (1996-8), WordUp* 
(1986-90), TurboJet (1988)
contributing programmer to DownloadFAST.com* (2001-), Corel Painter* (1993-5 at 
Fractal Design Corp), Corel ArtDabbler, EOS PhotoModeler (1996), FONTZ! (1988)
founder and main programmer of coming soon Paytector(tm).com
inventor of coming coon FlexCanvas(tm)

* denotes major involvement in massive multi-year R&D projects with millions of 
characters (1000s of pages) of code

Reply via email to