> --On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote:
> > Basically, I've got two option. All mail that is received 
> is backupped 
> > on the mailserver before adding any headers. I could match 
> those with 
> > mail received in the spam-learn and ham-learn accounts. 
> However, mail 
> > is backupped only for a limited amount of time before being moved, 
> > after which the mail-server hasn't got any access to it. So unless 
> > people report mail that found it's way through the filters 
> on a very 
> > regular basis it won't be a full proof sollution.
> 
> You don't really need a 100% solution; something which works 
> 80% of the time would probably be fine.  But you may not want 
> to do the programming needed to automate this.

I don't have the time for it yet, but I should be able t make something in
Perl. Personally, I'm no big fan of the 80% rule in programming as that last
undone 20% usually forms 80% of my problems :-)
 
> > The other option sounds more viable, I would only need to strip off 
> > the X-Scanned-By, X-Spam-* and X-Sanitized headers (which 
> are ignored 
> > in my setup for bayes anyhow), BUT I have no guarentee that the 
> > message is in it's original format. Some MIME-Boundry 
> rewriting may be 
> > done by the mailserver (where necessary), as is converting 8bit to 
> > 7bit where possible. And I think that there are many client-sided 
> > mailfiltering engines, spamscanners and virusscanners out 
> there that 
> > may do some rewriting as well.
> 
> You'll probably find that the various changes don't affect 
> bayes that much. 
> When a re-written message is learned you may make bayes miss 
> email which (in an ideal world) it would have caught, but I 
> think it will tend to classify messages around 50% "I don't 
> know if this is ham or spam" rather than classifying it 
> incorrectly.  And there should be enough unchanged tokens in 
> the messages to let bayes work anyways.
> 
> So I say strip off what you can but don't obsess about the 
> rest.  Feed it into bayes and see how it works, and only try 
> to fix it if you see bayes misclassifying email.

I'm not sure if I know of a good system to check and see if BAYES is
misclassifing, but I should be able to get some of that information from the
logfiles. Perhaps throing away mail that has been rewritten/reformatted
would be a sollution, thouh I don't know if those can be recognized easily.
We'll see :-)

Thanks for all the help and suggestions!

Kind Regards,
Sander Holthaus

Reply via email to