--On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote:
Basically, I've got two option. All mail that is received is backupped on
the mailserver before adding any headers. I could match those with mail
received in the spam-learn and ham-learn accounts. However, mail is
backupped only for a limited amount of time before being moved, after
which the mail-server hasn't got any access to it. So unless people
report mail that found it's way through the filters on a very regular
basis it won't be a full proof sollution.

You don't really need a 100% solution; something which works 80% of the time would probably be fine. But you may not want to do the programming needed to automate this.

The other option sounds more viable, I would only need to strip off the
X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my
setup for bayes anyhow), BUT I have no guarentee that the message is in
it's original format. Some MIME-Boundry rewriting may be done by the
mailserver (where necessary), as is converting 8bit to 7bit where
possible. And I think that there are many client-sided mailfiltering
engines, spamscanners and virusscanners out there that may do some
rewriting as well.

You'll probably find that the various changes don't affect bayes that much. When a re-written message is learned you may make bayes miss email which (in an ideal world) it would have caught, but I think it will tend to classify messages around 50% "I don't know if this is ham or spam" rather than classifying it incorrectly. And there should be enough unchanged tokens in the messages to let bayes work anyways.

So I say strip off what you can but don't obsess about the rest. Feed it into bayes and see how it works, and only try to fix it if you see bayes misclassifying email.

        -Kevin



Attachment: pgpBKhvCmRjqs.pgp
Description: PGP signature

Reply via email to