Re: Manually training SpamAssassin by forwarding mail
--On 02/04/05 09:17:55 -0400 Peter Marshall wrote: My question is the same as Henrik, I have a bunch of email that is spam (either tagged by spam assassin or not tagged at all. I forwared it as an attachment to a spam mail box. What do I have to do now before I can get bayes to learn the message ... I read you have to remove the headers Could anyone give me a little more detail ? There's no 100% good way to do this; it depends on how the message was mangled by the client (and possibly server). The only guaranteed way is (as I described) to save a copy at the same point as it is inspected by SpamAssassin so you can use it later. That being said, forwarding a message as an attachment will usually preserve the headers pretty well. The perl MailTools and MIME-tools modules have procedures to pull out attachments and save them in the Unix format which sa-learn wants. Sorry I don't have any ready-made scripts for this; my users dump messages into shared IMAP mailboxes which don't need any preprocessing before being fed to sa-learn. -Kevin pgpCJwlbtYhvO.pgp Description: PGP signature
RE: Manually training SpamAssassin by forwarding mail
--On 02/04/05 09:17:55 -0400 Peter Marshall wrote: My question is the same as Henrik, I have a bunch of email that is spam (either tagged by spam assassin or not tagged at all. I forwared it as an attachment to a spam mail box. What do I have to do now before I can get bayes to learn the message ... I read you have to remove the headers Could anyone give me a little more detail ? There's no 100% good way to do this; it depends on how the message was mangled by the client (and possibly server). The only guaranteed way is (as I described) to save a copy at the same point as it is inspected by SpamAssassin so you can use it later. That being said, forwarding a message as an attachment will usually preserve the headers pretty well. The perl MailTools and MIME-tools modules have procedures to pull out attachments and save them in the Unix format which sa-learn wants. Sorry I don't have any ready-made scripts for this; my users dump messages into shared IMAP mailboxes which don't need any preprocessing before being fed to sa-learn. -Kevin Basically, I've got two option. All mail that is received is backupped on the mailserver before adding any headers. I could match those with mail received in the spam-learn and ham-learn accounts. However, mail is backupped only for a limited amount of time before being moved, after which the mail-server hasn't got any access to it. So unless people report mail that found it's way through the filters on a very regular basis it won't be a full proof sollution. The other option sounds more viable, I would only need to strip off the X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my setup for bayes anyhow), BUT I have no guarentee that the message is in it's original format. Some MIME-Boundry rewriting may be done by the mailserver (where necessary), as is converting 8bit to 7bit where possible. And I think that there are many client-sided mailfiltering engines, spamscanners and virusscanners out there that may do some rewriting as well. From above, I'm not sure that learning spam-assassin using forwarded messages that may or may not be in the original format as SpamAssassin received them the first time is a good idea. But I don't have enough knowledge of SpamAssassin's internal workings and it's bayes-filter to be sure... Kind Regards, Sander Holthaus
RE: Manually training SpamAssassin by forwarding mail
--On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote: Basically, I've got two option. All mail that is received is backupped on the mailserver before adding any headers. I could match those with mail received in the spam-learn and ham-learn accounts. However, mail is backupped only for a limited amount of time before being moved, after which the mail-server hasn't got any access to it. So unless people report mail that found it's way through the filters on a very regular basis it won't be a full proof sollution. You don't really need a 100% solution; something which works 80% of the time would probably be fine. But you may not want to do the programming needed to automate this. The other option sounds more viable, I would only need to strip off the X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my setup for bayes anyhow), BUT I have no guarentee that the message is in it's original format. Some MIME-Boundry rewriting may be done by the mailserver (where necessary), as is converting 8bit to 7bit where possible. And I think that there are many client-sided mailfiltering engines, spamscanners and virusscanners out there that may do some rewriting as well. You'll probably find that the various changes don't affect bayes that much. When a re-written message is learned you may make bayes miss email which (in an ideal world) it would have caught, but I think it will tend to classify messages around 50% I don't know if this is ham or spam rather than classifying it incorrectly. And there should be enough unchanged tokens in the messages to let bayes work anyways. So I say strip off what you can but don't obsess about the rest. Feed it into bayes and see how it works, and only try to fix it if you see bayes misclassifying email. -Kevin pgpBKhvCmRjqs.pgp Description: PGP signature
RE: Manually training SpamAssassin by forwarding mail
--On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote: Basically, I've got two option. All mail that is received is backupped on the mailserver before adding any headers. I could match those with mail received in the spam-learn and ham-learn accounts. However, mail is backupped only for a limited amount of time before being moved, after which the mail-server hasn't got any access to it. So unless people report mail that found it's way through the filters on a very regular basis it won't be a full proof sollution. You don't really need a 100% solution; something which works 80% of the time would probably be fine. But you may not want to do the programming needed to automate this. I don't have the time for it yet, but I should be able t make something in Perl. Personally, I'm no big fan of the 80% rule in programming as that last undone 20% usually forms 80% of my problems :-) The other option sounds more viable, I would only need to strip off the X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my setup for bayes anyhow), BUT I have no guarentee that the message is in it's original format. Some MIME-Boundry rewriting may be done by the mailserver (where necessary), as is converting 8bit to 7bit where possible. And I think that there are many client-sided mailfiltering engines, spamscanners and virusscanners out there that may do some rewriting as well. You'll probably find that the various changes don't affect bayes that much. When a re-written message is learned you may make bayes miss email which (in an ideal world) it would have caught, but I think it will tend to classify messages around 50% I don't know if this is ham or spam rather than classifying it incorrectly. And there should be enough unchanged tokens in the messages to let bayes work anyways. So I say strip off what you can but don't obsess about the rest. Feed it into bayes and see how it works, and only try to fix it if you see bayes misclassifying email. I'm not sure if I know of a good system to check and see if BAYES is misclassifing, but I should be able to get some of that information from the logfiles. Perhaps throing away mail that has been rewritten/reformatted would be a sollution, thouh I don't know if those can be recognized easily. We'll see :-) Thanks for all the help and suggestions! Kind Regards, Sander Holthaus
Re: Manually training SpamAssassin by forwarding mail
Peter Marshall wrote: Kevin Sullivan wrote: --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote: I've been interested in offering customers to train manually train the SpamAssassin Bayes filter for ham and spam (to reduce false positives and negatives). However, I can only find documentation to this for local mailboxes and IMAP. Most users however, retrieve their mail through POP and use Outlook (Express) as mail client. Is there a way to train SpamAssassin with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)? If you want to do a lot of programming, you could save all incoming messages for a few days in a database somewhere. When a user forwards a message to a special ham or spam mailbox, you pull the message-id from the message and use it to recover the original message from your database. -Kevin My question is the same as Henrik, I have a bunch of email that is spam (either tagged by spam assassin or not tagged at all. I forwared it as an attachment to a spam mail box. What do I have to do now before I can get bayes to learn the message ... I read you have to remove the headers Could anyone give me a little more detail ? I use a modified version of the DMZS-sa-learn.pl from: http://www.dmzs.com/tools/files/spam.phtml When someone forwards a spam to me, I move the message to a special imap folder that gets processed by the script. My additions look something like: use Email::MIME; ... my $msg = Email::MIME-new($raw_message_body); my @parts = $msg-parts; foreach (@parts) { if ($_-content_type =~ m|message/rfc822|) { sa_learn($_-body_raw); } } I've tested this with messages forwarded as attachment from Outlook and Thunderbird. I'm not sure how effective it is though. I'm sure that it still looses something in the translation. All imap is really the way to go if you can. Stuart Johnston
RE: Manually training SpamAssassin by forwarding mail
-Original Message- From: Stuart Johnston [mailto:[EMAIL PROTECTED] Sent: Friday, February 04, 2005 5:20 PM To: users@spamassassin.apache.org Cc: Peter Marshall Subject: Re: Manually training SpamAssassin by forwarding mail Peter Marshall wrote: Kevin Sullivan wrote: --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote: I've been interested in offering customers to train manually train the SpamAssassin Bayes filter for ham and spam (to reduce false positives and negatives). However, I can only find documentation to this for local mailboxes and IMAP. Most users however, retrieve their mail through POP and use Outlook (Express) as mail client. Is there a way to train SpamAssassin with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)? If you want to do a lot of programming, you could save all incoming messages for a few days in a database somewhere. When a user forwards a message to a special ham or spam mailbox, you pull the message-id from the message and use it to recover the original message from your database. -Kevin My question is the same as Henrik, I have a bunch of email that is spam (either tagged by spam assassin or not tagged at all. I forwared it as an attachment to a spam mail box. What do I have to do now before I can get bayes to learn the message ... I read you have to remove the headers Could anyone give me a little more detail ? I use a modified version of the DMZS-sa-learn.pl from: http://www.dmzs.com/tools/files/spam.phtml When someone forwards a spam to me, I move the message to a special imap folder that gets processed by the script. My additions look something like: use Email::MIME; ... my $msg = Email::MIME-new($raw_message_body); my @parts = $msg-parts; foreach (@parts) { if ($_-content_type =~ m|message/rfc822|) { sa_learn($_-body_raw); } } I've tested this with messages forwarded as attachment from Outlook and Thunderbird. I'm not sure how effective it is though. I'm sure that it still looses something in the translation. All imap is really the way to go if you can. Stuart Johnston Would it be an idea to stip the delivered to-header from the message, as this will have no meaning to distinct between ham/spam? Also, I was wondering if anybody who is using spam-learn and ham-learn has any protection build in to stop non-system users from mailing to those addresses? Kind Regards, Sander Holthaus
Re: Manually training SpamAssassin by forwarding mail
Peter Marshall wrote: Stuart Johnston wrote: Peter Marshall wrote: Kevin Sullivan wrote: --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote: I've been interested in offering customers to train manually train the SpamAssassin Bayes filter for ham and spam (to reduce false positives and negatives). However, I can only find documentation to this for local mailboxes and IMAP. Most users however, retrieve their mail through POP and use Outlook (Express) as mail client. Is there a way to train SpamAssassin with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)? If you want to do a lot of programming, you could save all incoming messages for a few days in a database somewhere. When a user forwards a message to a special ham or spam mailbox, you pull the message-id from the message and use it to recover the original message from your database. -Kevin My question is the same as Henrik, I have a bunch of email that is spam (either tagged by spam assassin or not tagged at all. I forwared it as an attachment to a spam mail box. What do I have to do now before I can get bayes to learn the message ... I read you have to remove the headers Could anyone give me a little more detail ? I use a modified version of the DMZS-sa-learn.pl from: http://www.dmzs.com/tools/files/spam.phtml When someone forwards a spam to me, I move the message to a special imap folder that gets processed by the script. My additions look something like: use Email::MIME; ... my $msg = Email::MIME-new($raw_message_body); my @parts = $msg-parts; foreach (@parts) { if ($_-content_type =~ m|message/rfc822|) { sa_learn($_-body_raw); } } I've tested this with messages forwarded as attachment from Outlook and Thunderbird. I'm not sure how effective it is though. I'm sure that it still looses something in the translation. All imap is really the way to go if you can. Stuart Johnston But I have no imap .. only pop .. they would forwared (as attachment) to a mailbox, and then I have to run sa-learn ... I assume as root ? Will the stuff you posted work for this setup as well ?? Would there be big problems just running it after the forwared as attachment. ?? The code I posted only shows how you can extract the attached spam from the email. You'll need to write your own code to integrate it into your particular setup. BTW, in Outlook, you can easily attach multiple spams to one message and this code should handle it. Can users also forwared as attachemtn mail that was sent that was already marked as spam ... or is there any advantage to this ? If you use Bayes auto learn, I suspect that this wouldn't do much. Otherwise, it might help. Stuart Johnston
RE: Manually training SpamAssassin by forwarding mail
-Original Message- From: Stuart Johnston [mailto:[EMAIL PROTECTED] Sent: Friday, February 04, 2005 7:35 PM To: Peter Marshall; SpamAssassin Users Subject: Re: Manually training SpamAssassin by forwarding mail Peter Marshall wrote: Stuart Johnston wrote: Peter Marshall wrote: Kevin Sullivan wrote: --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote: I've been interested in offering customers to train manually train the SpamAssassin Bayes filter for ham and spam (to reduce false positives and negatives). However, I can only find documentation to this for local mailboxes and IMAP. Most users however, retrieve their mail through POP and use Outlook (Express) as mail client. Is there a way to train SpamAssassin with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)? If you want to do a lot of programming, you could save all incoming messages for a few days in a database somewhere. When a user forwards a message to a special ham or spam mailbox, you pull the message-id from the message and use it to recover the original message from your database. -Kevin My question is the same as Henrik, I have a bunch of email that is spam (either tagged by spam assassin or not tagged at all. I forwared it as an attachment to a spam mail box. What do I have to do now before I can get bayes to learn the message ... I read you have to remove the headers Could anyone give me a little more detail ? I use a modified version of the DMZS-sa-learn.pl from: http://www.dmzs.com/tools/files/spam.phtml When someone forwards a spam to me, I move the message to a special imap folder that gets processed by the script. My additions look something like: use Email::MIME; ... my $msg = Email::MIME-new($raw_message_body); my @parts = $msg-parts; foreach (@parts) { if ($_-content_type =~ m|message/rfc822|) { sa_learn($_-body_raw); } } I've tested this with messages forwarded as attachment from Outlook and Thunderbird. I'm not sure how effective it is though. I'm sure that it still looses something in the translation. All imap is really the way to go if you can. Stuart Johnston But I have no imap .. only pop .. they would forwared (as attachment) to a mailbox, and then I have to run sa-learn ... I assume as root ? Will the stuff you posted work for this setup as well ?? Would there be big problems just running it after the forwared as attachment. ?? The code I posted only shows how you can extract the attached spam from the email. You'll need to write your own code to integrate it into your particular setup. BTW, in Outlook, you can easily attach multiple spams to one message and this code should handle it. CTRL-a, right click, Forward Items will indeed do the trick. Can users also forwared as attachemtn mail that was sent that was already marked as spam ... or is there any advantage to this ? If you use Bayes auto learn, I suspect that this wouldn't do much. Otherwise, it might help. I would check the headers of the forwarded messages to see if their spam-score is above your auto-learning threshold. If it is, relearning is is perhaps quite useless. You might wonder why they received the message anyway (I would think that something that is good enough to autolearn is good enough to refuse or discard). Kind Regards, Sander Holthaus
RE: Manually training SpamAssassin by forwarding mail
First, I had understood that Bayes can learn previously tagged emails without stripping Spamassassin tags. Has this changed? Second, all of my users use a webmail client, though they can use OE if they wish. It is probably best for them to use IMAP so that server-side scanning can better be setup. I currently have 2 scripts that run nightly. The first takes everthing in the user's /home/user/mail/Spam folder and learns it as spam then empties it. The second does the same for Ham, but moved that mail to a Cleaned folder. All the user has to do is move untagged spam into Spam and false-positives into Ham. -- JAV -- Original Message --- From: Sander Holthaus - Orange XL [EMAIL PROTECTED] To: 'SpamAssassin Users' users@spamassassin.apache.org Cc: 'Stuart Johnston' [EMAIL PROTECTED], 'Peter Marshall' [EMAIL PROTECTED] Sent: Fri, 4 Feb 2005 19:47:40 +0100 Subject: RE: Manually training SpamAssassin by forwarding mail -Original Message- From: Stuart Johnston [mailto:[EMAIL PROTECTED] Sent: Friday, February 04, 2005 7:35 PM To: Peter Marshall; SpamAssassin Users Subject: Re: Manually training SpamAssassin by forwarding mail Peter Marshall wrote: Stuart Johnston wrote: Peter Marshall wrote: Kevin Sullivan wrote: --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote: I've been interested in offering customers to train manually train the SpamAssassin Bayes filter for ham and spam (to reduce false positives and negatives). However, I can only find documentation to this for local mailboxes and IMAP. Most users however, retrieve their mail through POP and use Outlook (Express) as mail client. Is there a way to train SpamAssassin with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)? If you want to do a lot of programming, you could save all incoming messages for a few days in a database somewhere. When a user forwards a message to a special ham or spam mailbox, you pull the message-id from the message and use it to recover the original message from your database. -Kevin My question is the same as Henrik, I have a bunch of email that is spam (either tagged by spam assassin or not tagged at all. I forwared it as an attachment to a spam mail box. What do I have to do now before I can get bayes to learn the message ... I read you have to remove the headers Could anyone give me a little more detail ? I use a modified version of the DMZS-sa-learn.pl from: http://www.dmzs.com/tools/files/spam.phtml When someone forwards a spam to me, I move the message to a special imap folder that gets processed by the script. My additions look something like: use Email::MIME; ... my $msg = Email::MIME-new($raw_message_body); my @parts = $msg-parts; foreach (@parts) { if ($_-content_type =~ m|message/rfc822|) { sa_learn($_-body_raw); } } I've tested this with messages forwarded as attachment from Outlook and Thunderbird. I'm not sure how effective it is though. I'm sure that it still looses something in the translation. All imap is really the way to go if you can. Stuart Johnston But I have no imap .. only pop .. they would forwared (as attachment) to a mailbox, and then I have to run sa-learn ... I assume as root ? Will the stuff you posted work for this setup as well ?? Would there be big problems just running it after the forwared as attachment. ?? The code I posted only shows how you can extract the attached spam from the email. You'll need to write your own code to integrate it into your particular setup. BTW, in Outlook, you can easily attach multiple spams to one message and this code should handle it. CTRL-a, right click, Forward Items will indeed do the trick. Can users also forwared as attachemtn mail that was sent that was already marked as spam ... or is there any advantage to this ? If you use Bayes auto learn, I suspect that this wouldn't do much. Otherwise, it might help. I would check the headers of the forwarded messages to see if their spam-score is above your auto-learning threshold. If it is, relearning is is perhaps quite useless. You might wonder why they received the message anyway (I would think that something that is good enough to autolearn is good enough to refuse or discard). Kind Regards, Sander Holthaus --- End of Original Message ---
RE: Manually training SpamAssassin by forwarding mail
At 07:59 PM 2/2/2005, Sander Holthaus - Orange XL wrote: I've been interested in offering customers to train manually train the SpamAssassin Bayes filter for ham and spam (to reduce false positives and negatives). However, I can only find documentation to this for local mailboxes and IMAP. Most users however, retrieve their mail through POP and use Outlook (Express) as mail client. Is there a way to train SpamAssassin with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)? Matt Kettler wrote: Only if you can somehow get the users to forward an un-mangled message, complete with original headers, as an attachment. You can then have a script strip off the attachments and feed those to sa-learn. The fundamental problem with normal forwarding is that from a SA perspective, the forwarded message looks very little like the original. New headers, different encoding, extra text often added to the body, superflous mime sections dropped, others added. Since SA learns from the message headers and some of the message encoding has an impact on learning, these changes cause problems.. Will Yardley wrote: There are various schemes to do this; the tricky part is getting people to submit emails in a consistent format - if you can get them to forward them as mesage/rfc822 attachments, it probably wouldn't be too hard to write a program to extract them and train... I imagine this would be too complicated for many users, though. One scheme that we've used is to have specially named IMAP folders that users can place mis-classified emails in for training.. then you can have a server-side robot which trains the filter and then discards the emails. Thanks, I figured that that would a the problem. Makes it pretty hard to impossible to create such a system for average users. I was hoping that SpamAssassin would include a system simiar to DSPAM. On the side, if I would get such a system working (where users are able to forward emails untouched and I am able to extract those messages to sa-learn), could I expect problem with some locally added headers? For instance, added headers when the message passes though a local anti-spam or anti-virus proxy. Or in case of IMAP, when users flag messages (or if they are automatically flagged) before moving them to a learn-ham / learn-spam folder? Kind Regards, Sander Holthaus