http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3078
------- Additional Comments From [EMAIL PROTECTED] 2007-06-14 23:23 ------- Hi, all, I will implement this Bayesian Noise Reduction (BNR) module as my google summer code project. If you do not yet know what BNR is, please refer to Jonathans original paper (http://bnr.nuclearelephant.com/). Here I will present my design and wait for your opinion. Surely there is couple of design choices. My choice is based the fact that BNR is only needed by bayes learner, and my primary goal is to keep configuration simpler and keep performance penalty low. Here is my general design for your review: 1. Where to hook the BNR filter? BNR works for a better bayes performance. So I will restrict BNR inside bayes module. Other modules will not see the result after BNR filtering. Before a message pass to bayes scanner, it was parsed into several parts. Visible body (all text of text/plain and visible part of text/html, html tag stripped), invisible body (hidden or hard to see part of text/html, html tag stripped), url list and all headers will be presented to bayes scanner. Visible body is a list of words which are visible to user in original order. BNR consistency check, which check the consistency of each words with its neighboring words, will be helpful to find out of context words. Whether or not BNR is helpful to invisible body is doubtful, since invisible body usually consists of short sentences or phrases scatter around the email. And other two other parts, url and heads are not applicable for BNR filtering. I will add BNR filtering code after tokenize visible body (and invisible body?). Then bayes learner will only see purified tokens. Another note is that in Jonanthans original work, html tag is not removed before BNR. SA is different, bayes in SA do not deal with html tag. Context pattern learning process will be added to bayes learning process. There will not be a separate context pattern learning process. Every time bayes update token counts (sa-learn or auto learn), it will also update context patterns. And context forget process is stick to token forget process too. There will be no expiry on context pattern, since at most 20*20*20=8000 rows will be in the database. Every time we purge, backup, restore, dump bayes database, we do the same thing on context patterns as well. 2. How to perform the noise reduction? Thanks for the open source release of libbnr by Jonanthan, the core BNR algorithm in C is about 80 lines. It should be easily adapted to SA. The tricky part is how to pass context pattern and token statistics to BNR. 3. What are the new configuration items? I will try to keep the number of new params small. Here are the new params: bnr_window_radius (default: 0.25), the window around 0.5 which BNR thinks not interesting. Default 0.25 means BNR will not care about the context which have a p-value between 0.25--0.75 bnr_token_radius (default: 0.3), maximum distance BNR will tolerate, otherwise, it is filter out. Eg, we have a window with p-values [0.10 0.60 0.70], and p-value of this context is 0.5, then the first word is filtered out use_bnr, whether or not to use bnr bayes_min_ham_num / bayes_min_spam_num, it is an existing bayes configuration, for simplicity, BNR will use it for the starting point to learn. It is because BNR can not start after bayes token values are becoming relatively stable. bnr_min_ham_num / bnr_min_spam_num, when BNR will get involved in scan 4. Which modules will be affected? SQL: new context pattern table CREATE TABLE context_pattern ( id int(11) NOT NULL default '0', pattern char(11) NOT NULL default '', spam_count int(11) NOT NULL default '0', ham_count int(11) NOT NULL default '0', PRIMARY KEY (id, token), ) BayesStore, include o BayesStore.pm (interface) o BayesStore::DBM.pm (DBM implementation) o BayesStore::SQL.pm (general SQL implementation) o BayesStore::MySQL.pm (MySQL) o BayesStore::PgSQL.pm (PostgreSQL) Modified function: o clear_database (clear context pattern also) o backup_database (backup context pattern also) o restore_database (restore context pattern also) o dump_db_toks (dump context pattern also) New function: o multi_context_count_change (save multiple context patterns) o _put_context_pattern (save context pattern) o context_get (retrieve context pattern) o context_get_all (retrieve multiple context patterns) Bayes.pm, sub tokenize Add various code right after the body part of a message has been tokenized, including scan, learn, forget. It sounds more proper to me to add a new plugin, such like "after_bayes_tokenize", add add all related code to this new plugin. Core BNR algorithm will be integrated into this part. Any comments and opinions are appreciated. Thank you Jianyong Dai ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
