http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3078





------- Additional Comments From [EMAIL PROTECTED]  2007-06-14 23:23 -------
Hi, all,
I will implement this Bayesian Noise Reduction (BNR) module as my google summer
code project. If you do not yet know what BNR is, please refer to Jonathan’s
original paper (http://bnr.nuclearelephant.com/). Here I will present my design
and wait for your opinion. Surely there is couple of design choices. My choice
is based the fact that BNR is only needed by bayes learner, and my primary goal
is to keep configuration simpler and keep performance penalty low. 

Here is my general design for your review:

1.      Where to hook the BNR filter?
BNR works for a better bayes performance. So I will restrict BNR inside bayes
module. Other modules will not see the result after BNR filtering. 
Before a message pass to bayes scanner, it was parsed into several parts.
Visible body (all text of text/plain and visible part of text/html, html tag
stripped), invisible body (hidden or “hard to see” part of text/html, html tag
stripped), url list and all headers will be presented to bayes scanner. Visible
body is a list of words which are visible to user in original order. BNR
consistency check, which check the consistency of each words with its
neighboring words, will be helpful to find out of context words. Whether or not
BNR is helpful to invisible body is doubtful, since invisible body usually
consists of short sentences or phrases scatter around the email. And other two
other parts, url and heads are not applicable for BNR filtering. I will add BNR
filtering code after tokenize visible body (and invisible body?). Then bayes
learner will only see purified tokens. Another note is that in Jonanthan’s
original work, html tag is not removed before BNR. SA is different, bayes in SA
do not deal with html tag.
Context pattern learning process will be added to bayes learning process. There
will not be a separate context pattern learning process. Every time bayes update
token counts (sa-learn or auto learn), it will also update context patterns. And
context forget process is stick to token forget process too.
There will be no expiry on context pattern, since at most 20*20*20=8000 rows
will be in the database. Every time we purge, backup, restore, dump bayes
database, we do the same thing on context patterns as well. 

2.      How to perform the noise reduction?
Thanks for the open source release of libbnr by Jonanthan, the core BNR
algorithm in C is about 80 lines. It should be easily adapted to SA. The tricky
part is how to pass context pattern and token statistics to BNR.

3.      What are the new configuration items?
I will try to keep the number of new params small. Here are the new params:
•        bnr_window_radius (default: 0.25), the window around 0.5 which BNR 
thinks 
not interesting. Default 0.25 means BNR will not care about the context 
which have a p-value between 0.25--0.75
•       bnr_token_radius (default: 0.3), maximum distance BNR will tolerate,
otherwise, it is filter out. Eg, we have a window with p-values [0.10 0.60
0.70], and p-value of this context is 0.5, then the first word is filtered out
•       use_bnr, whether or not to use bnr
•       bayes_min_ham_num / bayes_min_spam_num, it is an existing bayes 
configuration,
for simplicity, BNR will use it for the starting point to learn. It is because
BNR can not start after bayes token values are becoming relatively stable.
•       bnr_min_ham_num / bnr_min_spam_num, when BNR will get involved in scan

4.      Which modules will be affected?
•SQL: new context pattern table 
   CREATE TABLE context_pattern (
     id int(11) NOT NULL default '0',
     pattern char(11) NOT NULL default '',
     spam_count int(11) NOT NULL default '0',
     ham_count int(11) NOT NULL default '0',
     PRIMARY KEY  (id, token),
)
•BayesStore, include
    o   BayesStore.pm (interface)
    o   BayesStore::DBM.pm (DBM implementation)
    o   BayesStore::SQL.pm (general SQL implementation)
    o   BayesStore::MySQL.pm (MySQL)
    o   BayesStore::PgSQL.pm (PostgreSQL)
    Modified function:  
    o   clear_database (clear context pattern also)
    o   backup_database (backup context pattern also)
    o   restore_database (restore context pattern also)
    o   dump_db_toks (dump context pattern also)
    New function:
    o   multi_context_count_change (save multiple context patterns)
    o   _put_context_pattern (save context pattern)
    o   context_get (retrieve context pattern)
    o   context_get_all (retrieve multiple context patterns)

    •Bayes.pm, sub tokenize
    Add various code right after the body part of a message has been tokenized,
including scan, learn, forget. It sounds more proper to me to add a new plugin,
such like "after_bayes_tokenize", add add all related code to this new plugin.
Core BNR algorithm will be integrated into this part.

Any comments and opinions are appreciated.

Thank you
Jianyong Dai




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to