Re: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Kevin Sullivan
--On 02/04/05 09:17:55 -0400 Peter Marshall wrote:
My question is the same as Henrik, I have a bunch of email that is spam
(either tagged by spam assassin or not tagged at all.  I forwared it as
an attachment to a spam mail box.  What do I have to do now before I
can get bayes to learn the message ... I read you have to remove the
headers  Could anyone give me a little more detail ?
There's no 100% good way to do this; it depends on how the message was 
mangled by the client (and possibly server).  The only guaranteed way is 
(as I described) to save a copy at the same point as it is inspected by 
SpamAssassin so you can use it later.

That being said, forwarding a message as an attachment will usually 
preserve the headers pretty well.  The perl MailTools and MIME-tools 
modules have procedures to pull out attachments and save them in the Unix 
format which sa-learn wants.

Sorry I don't have any ready-made scripts for this; my users dump messages 
into shared IMAP mailboxes which don't need any preprocessing before being 
fed to sa-learn.

	-Kevin


pgpCJwlbtYhvO.pgp
Description: PGP signature


RE: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Sander Holthaus - Orange XL
 --On 02/04/05 09:17:55 -0400 Peter Marshall wrote:
  My question is the same as Henrik, I have a bunch of email that is 
  spam (either tagged by spam assassin or not tagged at all.  
 I forwared 
  it as an attachment to a spam mail box.  What do I have to do now 
  before I can get bayes to learn the message ... I read you have to 
  remove the headers  Could anyone give me a little more detail ?
 
 There's no 100% good way to do this; it depends on how the 
 message was mangled by the client (and possibly server).  The 
 only guaranteed way is (as I described) to save a copy at the 
 same point as it is inspected by SpamAssassin so you can use it later.
 
 That being said, forwarding a message as an attachment will 
 usually preserve the headers pretty well.  The perl MailTools 
 and MIME-tools modules have procedures to pull out 
 attachments and save them in the Unix format which sa-learn wants.
 
 Sorry I don't have any ready-made scripts for this; my users 
 dump messages into shared IMAP mailboxes which don't need any 
 preprocessing before being fed to sa-learn.
 
   -Kevin

Basically, I've got two option. All mail that is received is backupped on
the mailserver before adding any headers. I could match those with mail
received in the spam-learn and ham-learn accounts. However, mail is
backupped only for a limited amount of time before being moved, after which
the mail-server hasn't got any access to it. So unless people report mail
that found it's way through the filters on a very regular basis it won't be
a full proof sollution.

The other option sounds more viable, I would only need to strip off the
X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my
setup for bayes anyhow), BUT I have no guarentee that the message is in it's
original format. Some MIME-Boundry rewriting may be done by the mailserver
(where necessary), as is converting 8bit to 7bit where possible. And I think
that there are many client-sided mailfiltering engines, spamscanners and
virusscanners out there that may do some rewriting as well.

From above, I'm not sure that learning spam-assassin using forwarded
messages that may or may not be in the original format as SpamAssassin
received them the first time is a good idea. But I don't have enough
knowledge of SpamAssassin's internal workings and it's bayes-filter to be
sure...

Kind Regards,
Sander Holthaus



RE: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Kevin Sullivan
--On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote:
Basically, I've got two option. All mail that is received is backupped on
the mailserver before adding any headers. I could match those with mail
received in the spam-learn and ham-learn accounts. However, mail is
backupped only for a limited amount of time before being moved, after
which the mail-server hasn't got any access to it. So unless people
report mail that found it's way through the filters on a very regular
basis it won't be a full proof sollution.
You don't really need a 100% solution; something which works 80% of the 
time would probably be fine.  But you may not want to do the programming 
needed to automate this.

The other option sounds more viable, I would only need to strip off the
X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my
setup for bayes anyhow), BUT I have no guarentee that the message is in
it's original format. Some MIME-Boundry rewriting may be done by the
mailserver (where necessary), as is converting 8bit to 7bit where
possible. And I think that there are many client-sided mailfiltering
engines, spamscanners and virusscanners out there that may do some
rewriting as well.
You'll probably find that the various changes don't affect bayes that much. 
When a re-written message is learned you may make bayes miss email which 
(in an ideal world) it would have caught, but I think it will tend to 
classify messages around 50% I don't know if this is ham or spam rather 
than classifying it incorrectly.  And there should be enough unchanged 
tokens in the messages to let bayes work anyways.

So I say strip off what you can but don't obsess about the rest.  Feed it 
into bayes and see how it works, and only try to fix it if you see bayes 
misclassifying email.

-Kevin



pgpBKhvCmRjqs.pgp
Description: PGP signature


RE: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Sander Holthaus - Orange XL
 --On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote:
  Basically, I've got two option. All mail that is received 
 is backupped 
  on the mailserver before adding any headers. I could match 
 those with 
  mail received in the spam-learn and ham-learn accounts. 
 However, mail 
  is backupped only for a limited amount of time before being moved, 
  after which the mail-server hasn't got any access to it. So unless 
  people report mail that found it's way through the filters 
 on a very 
  regular basis it won't be a full proof sollution.
 
 You don't really need a 100% solution; something which works 
 80% of the time would probably be fine.  But you may not want 
 to do the programming needed to automate this.

I don't have the time for it yet, but I should be able t make something in
Perl. Personally, I'm no big fan of the 80% rule in programming as that last
undone 20% usually forms 80% of my problems :-)
 
  The other option sounds more viable, I would only need to strip off 
  the X-Scanned-By, X-Spam-* and X-Sanitized headers (which 
 are ignored 
  in my setup for bayes anyhow), BUT I have no guarentee that the 
  message is in it's original format. Some MIME-Boundry 
 rewriting may be 
  done by the mailserver (where necessary), as is converting 8bit to 
  7bit where possible. And I think that there are many client-sided 
  mailfiltering engines, spamscanners and virusscanners out 
 there that 
  may do some rewriting as well.
 
 You'll probably find that the various changes don't affect 
 bayes that much. 
 When a re-written message is learned you may make bayes miss 
 email which (in an ideal world) it would have caught, but I 
 think it will tend to classify messages around 50% I don't 
 know if this is ham or spam rather than classifying it 
 incorrectly.  And there should be enough unchanged tokens in 
 the messages to let bayes work anyways.
 
 So I say strip off what you can but don't obsess about the 
 rest.  Feed it into bayes and see how it works, and only try 
 to fix it if you see bayes misclassifying email.

I'm not sure if I know of a good system to check and see if BAYES is
misclassifing, but I should be able to get some of that information from the
logfiles. Perhaps throing away mail that has been rewritten/reformatted
would be a sollution, thouh I don't know if those can be recognized easily.
We'll see :-)

Thanks for all the help and suggestions!

Kind Regards,
Sander Holthaus



Re: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Stuart Johnston
Peter Marshall wrote:
Kevin Sullivan wrote:
--On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
I've been interested in offering customers to train manually train the
SpamAssassin Bayes filter for ham and spam (to reduce false positives 
and
negatives). However, I can only find documentation to this for local
mailboxes and IMAP. Most users however, retrieve their mail through POP
and use Outlook (Express) as mail client. Is there a way to train
SpamAssassin with such a setup (e.g. forwarding mail with Outlook
(Express) using SMTP)?

If you want to do a lot of programming, you could save all incoming 
messages for a few days in a database somewhere.  When a user forwards 
a message to a special ham or spam mailbox, you pull the 
message-id from the message and use it to recover the original message 
from your database.

-Kevin

My question is the same as Henrik, I have a bunch of email that is spam 
(either tagged by spam assassin or not tagged at all.  I forwared it as 
an attachment to a spam mail box.  What do I have to do now before I 
can get bayes to learn the message ... I read you have to remove the 
headers  Could anyone give me a little more detail ?
I use a modified version of the DMZS-sa-learn.pl from: 
http://www.dmzs.com/tools/files/spam.phtml

When someone forwards a spam to me, I move the message to a special imap 
folder that gets processed by the script.  My additions look something like:

use Email::MIME;
...
my $msg = Email::MIME-new($raw_message_body);
my @parts = $msg-parts;
foreach (@parts) {
  if ($_-content_type =~ m|message/rfc822|) {
sa_learn($_-body_raw);
  }
}
I've tested this with messages forwarded as attachment from Outlook and 
Thunderbird.  I'm not sure how effective it is though.  I'm sure that it 
still looses something in the translation.  All imap is really the way 
to go if you can.

Stuart Johnston


RE: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Sander Holthaus - Orange XL
 

 -Original Message-
 From: Stuart Johnston [mailto:[EMAIL PROTECTED] 
 Sent: Friday, February 04, 2005 5:20 PM
 To: users@spamassassin.apache.org
 Cc: Peter Marshall
 Subject: Re: Manually training SpamAssassin by forwarding mail
 
 Peter Marshall wrote:
  Kevin Sullivan wrote:
  
  --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
 
  I've been interested in offering customers to train 
 manually train 
  the SpamAssassin Bayes filter for ham and spam (to reduce false 
  positives and negatives). However, I can only find 
 documentation to 
  this for local mailboxes and IMAP. Most users however, retrieve 
  their mail through POP and use Outlook (Express) as mail 
 client. Is 
  there a way to train SpamAssassin with such a setup (e.g. 
 forwarding 
  mail with Outlook
  (Express) using SMTP)?
 
 
 
  If you want to do a lot of programming, you could save all 
 incoming 
  messages for a few days in a database somewhere.  When a user 
  forwards a message to a special ham or spam mailbox, 
 you pull the 
  message-id from the message and use it to recover the original 
  message from your database.
 
  -Kevin
  
  
  My question is the same as Henrik, I have a bunch of email that is 
  spam (either tagged by spam assassin or not tagged at all.  
 I forwared 
  it as an attachment to a spam mail box.  What do I have to do now 
  before I can get bayes to learn the message ... I read you have to 
  remove the headers  Could anyone give me a little more detail ?
 
 I use a modified version of the DMZS-sa-learn.pl from: 
 http://www.dmzs.com/tools/files/spam.phtml
 
 When someone forwards a spam to me, I move the message to a 
 special imap folder that gets processed by the script.  My 
 additions look something like:
 
 use Email::MIME;
 ...
 my $msg = Email::MIME-new($raw_message_body);
 
 my @parts = $msg-parts;
 
 foreach (@parts) {
if ($_-content_type =~ m|message/rfc822|) {
  sa_learn($_-body_raw);
}
 }
 
 
 I've tested this with messages forwarded as attachment from 
 Outlook and Thunderbird.  I'm not sure how effective it is 
 though.  I'm sure that it still looses something in the 
 translation.  All imap is really the way to go if you can.
 
 
 Stuart Johnston

Would it be an idea to stip the delivered to-header from the message, as
this will have no meaning to distinct between ham/spam? 

Also, I was wondering if anybody who is using spam-learn and ham-learn has
any protection build in to stop non-system users from mailing to those
addresses? 

Kind Regards,
Sander Holthaus



Re: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Stuart Johnston
Peter Marshall wrote:
Stuart Johnston wrote:
Peter Marshall wrote:
Kevin Sullivan wrote:
--On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
I've been interested in offering customers to train manually train the
SpamAssassin Bayes filter for ham and spam (to reduce false 
positives and
negatives). However, I can only find documentation to this for local
mailboxes and IMAP. Most users however, retrieve their mail through 
POP
and use Outlook (Express) as mail client. Is there a way to train
SpamAssassin with such a setup (e.g. forwarding mail with Outlook
(Express) using SMTP)?


If you want to do a lot of programming, you could save all incoming 
messages for a few days in a database somewhere.  When a user 
forwards a message to a special ham or spam mailbox, you pull 
the message-id from the message and use it to recover the original 
message from your database.

-Kevin


My question is the same as Henrik, I have a bunch of email that is 
spam (either tagged by spam assassin or not tagged at all.  I 
forwared it as an attachment to a spam mail box.  What do I have to 
do now before I can get bayes to learn the message ... I read you 
have to remove the headers  Could anyone give me a little more 
detail ?

I use a modified version of the DMZS-sa-learn.pl from: 
http://www.dmzs.com/tools/files/spam.phtml

When someone forwards a spam to me, I move the message to a special 
imap folder that gets processed by the script.  My additions look 
something like:

use Email::MIME;
...
my $msg = Email::MIME-new($raw_message_body);
my @parts = $msg-parts;
foreach (@parts) {
  if ($_-content_type =~ m|message/rfc822|) {
sa_learn($_-body_raw);
  }
}
I've tested this with messages forwarded as attachment from Outlook 
and Thunderbird.  I'm not sure how effective it is though.  I'm sure 
that it still looses something in the translation.  All imap is really 
the way to go if you can.

Stuart Johnston

But I have no imap .. only pop .. they would forwared (as attachment) to 
a mailbox, and then I have to run sa-learn ... I assume as root ?

Will the stuff you posted work for this setup as well ??
Would there be big problems just running it after the forwared as 
attachment. ??
The code I posted only shows how you can extract the attached spam from 
the email.  You'll need to write your own code to integrate it into your 
particular setup.

BTW, in Outlook, you can easily attach multiple spams to one message and 
this code should handle it.

Can users also forwared as attachemtn mail that was sent that was 
already marked as spam ... or is there any advantage to this ?
If you use Bayes auto learn, I suspect that this wouldn't do much. 
Otherwise, it might help.

Stuart Johnston


RE: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Sander Holthaus - Orange XL
 

 -Original Message-
 From: Stuart Johnston [mailto:[EMAIL PROTECTED] 
 Sent: Friday, February 04, 2005 7:35 PM
 To: Peter Marshall; SpamAssassin Users
 Subject: Re: Manually training SpamAssassin by forwarding mail
 
 Peter Marshall wrote:
  Stuart Johnston wrote:
  
  Peter Marshall wrote:
 
  Kevin Sullivan wrote:
 
  --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
 
  I've been interested in offering customers to train 
 manually train 
  the SpamAssassin Bayes filter for ham and spam (to reduce false 
  positives and negatives). However, I can only find 
 documentation 
  to this for local mailboxes and IMAP. Most users 
 however, retrieve 
  their mail through POP and use Outlook (Express) as 
 mail client. 
  Is there a way to train SpamAssassin with such a setup (e.g. 
  forwarding mail with Outlook
  (Express) using SMTP)?
 
 
 
 
 
  If you want to do a lot of programming, you could save 
 all incoming 
  messages for a few days in a database somewhere.  When a user 
  forwards a message to a special ham or spam mailbox, 
 you pull 
  the message-id from the message and use it to recover 
 the original 
  message from your database.
 
  -Kevin
 
 
 
 
  My question is the same as Henrik, I have a bunch of 
 email that is 
  spam (either tagged by spam assassin or not tagged at all.  I 
  forwared it as an attachment to a spam mail box.  What 
 do I have 
  to do now before I can get bayes to learn the message ... 
 I read you 
  have to remove the headers  Could anyone give me a 
 little more 
  detail ?
 
 
 
  I use a modified version of the DMZS-sa-learn.pl from: 
  http://www.dmzs.com/tools/files/spam.phtml
 
  When someone forwards a spam to me, I move the message to 
 a special 
  imap folder that gets processed by the script.  My additions look 
  something like:
 
  use Email::MIME;
  ...
  my $msg = Email::MIME-new($raw_message_body);
 
  my @parts = $msg-parts;
 
  foreach (@parts) {
if ($_-content_type =~ m|message/rfc822|) {
  sa_learn($_-body_raw);
}
  }
 
 
  I've tested this with messages forwarded as attachment 
 from Outlook 
  and Thunderbird.  I'm not sure how effective it is though. 
  I'm sure 
  that it still looses something in the translation.  All imap is 
  really the way to go if you can.
 
 
  Stuart Johnston
 
 
  But I have no imap .. only pop .. they would forwared (as 
 attachment) 
  to a mailbox, and then I have to run sa-learn ... I assume as root ?
  
  Will the stuff you posted work for this setup as well ??
  
  Would there be big problems just running it after the forwared as 
  attachment. ??
 
 The code I posted only shows how you can extract the attached 
 spam from the email.  You'll need to write your own code to 
 integrate it into your particular setup.
 
 BTW, in Outlook, you can easily attach multiple spams to one 
 message and this code should handle it.

CTRL-a, right click, Forward Items will indeed do the trick.

  
  Can users also forwared as attachemtn mail that was sent that was 
  already marked as spam ... or is there any advantage to this ?
 
 If you use Bayes auto learn, I suspect that this wouldn't do much. 
 Otherwise, it might help.

I would check the headers of the forwarded messages to see if their
spam-score is above your auto-learning threshold. If it is, relearning is is
perhaps quite useless. You might wonder why they received the message anyway
(I would think that something that is good enough to autolearn is good
enough to refuse or discard).

Kind Regards,
Sander Holthaus



RE: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Joe Polk
First, I had understood that Bayes can learn previously tagged emails without
stripping Spamassassin tags. Has this changed?

Second, all of my users use a webmail client, though they can use OE if they
wish. It is probably best for them to use IMAP so that server-side scanning
can better be setup. I currently have 2 scripts that run nightly. The first
takes everthing in the user's /home/user/mail/Spam folder and learns it as
spam then empties it. The second does the same for Ham, but moved that mail to
a Cleaned folder. All the user has to do is move untagged spam into Spam and
false-positives into Ham.

--
JAV


-- Original Message ---
From: Sander Holthaus - Orange XL [EMAIL PROTECTED]
To: 'SpamAssassin Users' users@spamassassin.apache.org
Cc: 'Stuart Johnston' [EMAIL PROTECTED], 'Peter Marshall'
[EMAIL PROTECTED]
Sent: Fri, 4 Feb 2005 19:47:40 +0100
Subject: RE: Manually training SpamAssassin by forwarding mail

  -Original Message-
  From: Stuart Johnston [mailto:[EMAIL PROTECTED] 
  Sent: Friday, February 04, 2005 7:35 PM
  To: Peter Marshall; SpamAssassin Users
  Subject: Re: Manually training SpamAssassin by forwarding mail
  
  Peter Marshall wrote:
   Stuart Johnston wrote:
   
   Peter Marshall wrote:
  
   Kevin Sullivan wrote:
  
   --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
  
   I've been interested in offering customers to train 
  manually train 
   the SpamAssassin Bayes filter for ham and spam (to reduce false 
   positives and negatives). However, I can only find 
  documentation 
   to this for local mailboxes and IMAP. Most users 
  however, retrieve 
   their mail through POP and use Outlook (Express) as 
  mail client. 
   Is there a way to train SpamAssassin with such a setup (e.g. 
   forwarding mail with Outlook
   (Express) using SMTP)?
  
  
  
  
  
   If you want to do a lot of programming, you could save 
  all incoming 
   messages for a few days in a database somewhere.  When a user 
   forwards a message to a special ham or spam mailbox, 
  you pull 
   the message-id from the message and use it to recover 
  the original 
   message from your database.
  
   -Kevin
  
  
  
  
   My question is the same as Henrik, I have a bunch of 
  email that is 
   spam (either tagged by spam assassin or not tagged at all.  I 
   forwared it as an attachment to a spam mail box.  What 
  do I have 
   to do now before I can get bayes to learn the message ... 
  I read you 
   have to remove the headers  Could anyone give me a 
  little more 
   detail ?
  
  
  
   I use a modified version of the DMZS-sa-learn.pl from: 
   http://www.dmzs.com/tools/files/spam.phtml
  
   When someone forwards a spam to me, I move the message to 
  a special 
   imap folder that gets processed by the script.  My additions look 
   something like:
  
   use Email::MIME;
   ...
   my $msg = Email::MIME-new($raw_message_body);
  
   my @parts = $msg-parts;
  
   foreach (@parts) {
 if ($_-content_type =~ m|message/rfc822|) {
   sa_learn($_-body_raw);
 }
   }
  
  
   I've tested this with messages forwarded as attachment 
  from Outlook 
   and Thunderbird.  I'm not sure how effective it is though. 
   I'm sure 
   that it still looses something in the translation.  All imap is 
   really the way to go if you can.
  
  
   Stuart Johnston
  
  
   But I have no imap .. only pop .. they would forwared (as 
  attachment) 
   to a mailbox, and then I have to run sa-learn ... I assume as root ?
   
   Will the stuff you posted work for this setup as well ??
   
   Would there be big problems just running it after the forwared as 
   attachment. ??
  
  The code I posted only shows how you can extract the attached 
  spam from the email.  You'll need to write your own code to 
  integrate it into your particular setup.
  
  BTW, in Outlook, you can easily attach multiple spams to one 
  message and this code should handle it.
 
 CTRL-a, right click, Forward Items will indeed do the trick.
 
   
   Can users also forwared as attachemtn mail that was sent that was 
   already marked as spam ... or is there any advantage to this ?
  
  If you use Bayes auto learn, I suspect that this wouldn't do much. 
  Otherwise, it might help.
 
 I would check the headers of the forwarded messages to see if their
 spam-score is above your auto-learning threshold. If it is,
  relearning is is perhaps quite useless. You might wonder why they 
 received the message anyway
 (I would think that something that is good enough to autolearn is 
 good enough to refuse or discard).
 
 Kind Regards,
 Sander Holthaus
--- End of Original Message ---



RE: Manually training SpamAssassin by forwarding mail

2005-02-03 Thread Sander Holthaus - Orange XL
 At 07:59 PM 2/2/2005, Sander Holthaus - Orange XL wrote:
 I've been interested in offering customers to train manually 
 train the 
 SpamAssassin Bayes filter for ham and spam (to reduce false 
 positives 
 and negatives). However, I can only find documentation to this for 
 local mailboxes and IMAP. Most users however, retrieve their mail 
 through POP and use Outlook (Express) as mail client. Is 
 there a way to 
 train SpamAssassin with such a setup (e.g. forwarding mail 
 with Outlook
 (Express) using SMTP)?
 

Matt Kettler wrote:
 
 Only if you can somehow get the users to forward an 
 un-mangled message, 
 complete with original headers, as an attachment. You can then have a 
 script strip off the attachments and feed those to sa-learn.
 
 The fundamental problem with normal forwarding is that from a SA 
 perspective, the forwarded message looks very little like the 
 original. New 
 headers, different encoding, extra text often added to the 
 body, superflous 
 mime sections dropped, others added.
 
 Since SA learns from the message headers and some of the 
 message encoding 
 has an impact on learning, these changes cause problems.. 


Will Yardley wrote:
 There are various schemes to do this; the tricky part is 
 getting people to submit emails in a consistent format - if 
 you can get them to forward them as mesage/rfc822 
 attachments, it probably wouldn't be too hard to write a 
 program to extract them and train... I imagine this would be 
 too complicated for many users, though.
 
 One scheme that we've used is to have specially named IMAP 
 folders that users can place mis-classified emails in for 
 training.. then you can have a server-side robot which trains 
 the filter and then discards the emails.


Thanks, I figured that that would a the problem. Makes it pretty hard to
impossible to create such a system for average users. I was hoping that
SpamAssassin would include a system simiar to DSPAM.

On the side, if I would get such a system working (where users are able to
forward emails untouched and I am able to extract those messages to
sa-learn), could I expect problem with some locally added headers? For
instance, added headers when the message passes though a local anti-spam or
anti-virus proxy. Or in case of IMAP, when users flag messages (or if they
are automatically flagged) before moving them to a learn-ham / learn-spam
folder?

Kind Regards,
Sander Holthaus