> --On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote: > > Basically, I've got two option. All mail that is received > is backupped > > on the mailserver before adding any headers. I could match > those with > > mail received in the spam-learn and ham-learn accounts. > However, mail > > is backupped only for a limited amount of time before being moved, > > after which the mail-server hasn't got any access to it. So unless > > people report mail that found it's way through the filters > on a very > > regular basis it won't be a full proof sollution. > > You don't really need a 100% solution; something which works > 80% of the time would probably be fine. But you may not want > to do the programming needed to automate this.
I don't have the time for it yet, but I should be able t make something in Perl. Personally, I'm no big fan of the 80% rule in programming as that last undone 20% usually forms 80% of my problems :-) > > The other option sounds more viable, I would only need to strip off > > the X-Scanned-By, X-Spam-* and X-Sanitized headers (which > are ignored > > in my setup for bayes anyhow), BUT I have no guarentee that the > > message is in it's original format. Some MIME-Boundry > rewriting may be > > done by the mailserver (where necessary), as is converting 8bit to > > 7bit where possible. And I think that there are many client-sided > > mailfiltering engines, spamscanners and virusscanners out > there that > > may do some rewriting as well. > > You'll probably find that the various changes don't affect > bayes that much. > When a re-written message is learned you may make bayes miss > email which (in an ideal world) it would have caught, but I > think it will tend to classify messages around 50% "I don't > know if this is ham or spam" rather than classifying it > incorrectly. And there should be enough unchanged tokens in > the messages to let bayes work anyways. > > So I say strip off what you can but don't obsess about the > rest. Feed it into bayes and see how it works, and only try > to fix it if you see bayes misclassifying email. I'm not sure if I know of a good system to check and see if BAYES is misclassifing, but I should be able to get some of that information from the logfiles. Perhaps throing away mail that has been rewritten/reformatted would be a sollution, thouh I don't know if those can be recognized easily. We'll see :-) Thanks for all the help and suggestions! Kind Regards, Sander Holthaus