Basically, I've got two option. All mail that is received is backupped on the mailserver before adding any headers. I could match those with mail received in the spam-learn and ham-learn accounts. However, mail is backupped only for a limited amount of time before being moved, after which the mail-server hasn't got any access to it. So unless people report mail that found it's way through the filters on a very regular basis it won't be a full proof sollution.
You don't really need a 100% solution; something which works 80% of the time would probably be fine. But you may not want to do the programming needed to automate this.
The other option sounds more viable, I would only need to strip off the X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my setup for bayes anyhow), BUT I have no guarentee that the message is in it's original format. Some MIME-Boundry rewriting may be done by the mailserver (where necessary), as is converting 8bit to 7bit where possible. And I think that there are many client-sided mailfiltering engines, spamscanners and virusscanners out there that may do some rewriting as well.
You'll probably find that the various changes don't affect bayes that much. When a re-written message is learned you may make bayes miss email which (in an ideal world) it would have caught, but I think it will tend to classify messages around 50% "I don't know if this is ham or spam" rather than classifying it incorrectly. And there should be enough unchanged tokens in the messages to let bayes work anyways.
So I say strip off what you can but don't obsess about the rest. Feed it into bayes and see how it works, and only try to fix it if you see bayes misclassifying email.
-Kevin
pgpBKhvCmRjqs.pgp
Description: PGP signature