Re: Feeding SA-learn

2008-01-21 Thread Jari Fredriksson
> Hey list,
> 
> Can I feed a plain text file representing just the body
> of a message to sa-learn?
> 
> /Diego

Yes you can, who to stop it?

I just  sent your message body as --ham, and it told it learned one message.




Re: Feeding SA-learn

2008-01-21 Thread Diego Pomatta

Jari Fredriksson escribió:

Hey list,

Can I feed a plain text file representing just the body
of a message to sa-learn?

/Diego



Yes you can, who to stop it?

I just  sent your message body as --ham, and it told it learned one message.

  

I meant without the headers, just the body.
ok thanks


Re: Feeding SA-learn

2008-01-21 Thread Anthony Peacock

Diego Pomatta wrote:

Jari Fredriksson escribió:

Hey list,

Can I feed a plain text file representing just the body
of a message to sa-learn?

/Diego



Yes you can, who to stop it?

I just  sent your message body as --ham, and it told it learned one 
message.


  

I meant without the headers, just the body.
ok thanks



Well the short answer is, yes you can.

The slightly longer answer is that you won't get as good results doing 
this, as the Bayes system uses tokens found in the complete message.  By 
only learning on the body you will not gain any advantage for tokens 
found in headers.



--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:http://www.chime.ucl.ac.uk/~rmhiajp/
"A CAT scan should take less time than a PET scan.  For a CAT scan,
 they're only looking for one thing, whereas a PET scan could result in
 a lot of things."- Carl Princi, 2002/07/19


Re: Feeding SA-learn

2008-01-23 Thread Diego Pomatta

Anthony Peacock escribió:

Can I feed a plain text file representing just the body
of a message to sa-learn?

/Diego



Yes you can, who to stop it?

I just  sent your message body as --ham, and it told it learned one 
message.


  

I meant without the headers, just the body.
ok thanks



Well the short answer is, yes you can.

The slightly longer answer is that you won't get as good results doing 
this, as the Bayes system uses tokens found in the complete message.  
By only learning on the body you will not gain any advantage for 
tokens found in headers.





Yep, I know, precisely the problem is that I don't have the original 
headers after the mail has been delivered.
My intention was to manually feed the few spam messages that slip thru 
undetected. By the time I get a hold of those, they are in the 
recipient's mail client inbox, not in the server.
I was thinking, if I save the mail as EML files, would that preserve the 
headers in a way that sa-learn can parse correctly?


Thanks
/Diego


Re: Feeding SA-learn

2008-01-23 Thread Anthony Peacock

Diego Pomatta wrote:

Anthony Peacock escribió:

Can I feed a plain text file representing just the body
of a message to sa-learn?

/Diego



Yes you can, who to stop it?

I just  sent your message body as --ham, and it told it learned one 
message.


  

I meant without the headers, just the body.
ok thanks



Well the short answer is, yes you can.

The slightly longer answer is that you won't get as good results doing 
this, as the Bayes system uses tokens found in the complete message.  
By only learning on the body you will not gain any advantage for 
tokens found in headers.





Yep, I know, precisely the problem is that I don't have the original 
headers after the mail has been delivered.
My intention was to manually feed the few spam messages that slip thru 
undetected. By the time I get a hold of those, they are in the 
recipient's mail client inbox, not in the server.
I was thinking, if I save the mail as EML files, would that preserve the 
headers in a way that sa-learn can parse correctly?



Depends on the client.

For instance, Thunderbird stores it's folders in mbox format, so 
sa-learn can work against those files as-is.  Other email clients can 
save emails in text format complete with headers.


The biggest problem with this is training the users to do that consistantly.


--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:http://www.chime.ucl.ac.uk/~rmhiajp/
"A CAT scan should take less time than a PET scan.  For a CAT scan,
 they're only looking for one thing, whereas a PET scan could result in
 a lot of things."- Carl Princi, 2002/07/19


Re: Feeding SA-learn

2008-01-23 Thread Diego Pomatta

Anthony Peacock escribió:

Well the short answer is, yes you can.

The slightly longer answer is that you won't get as good results 
doing this, as the Bayes system uses tokens found in the complete 
message.  By only learning on the body you will not gain any 
advantage for tokens found in headers.





Yep, I know, precisely the problem is that I don't have the original 
headers after the mail has been delivered.
My intention was to manually feed the few spam messages that slip 
thru undetected. By the time I get a hold of those, they are in the 
recipient's mail client inbox, not in the server.
I was thinking, if I save the mail as EML files, would that preserve 
the headers in a way that sa-learn can parse correctly?



Depends on the client.

For instance, Thunderbird stores it's folders in mbox format, so 
sa-learn can work against those files as-is. Other email clients can 
save emails in text format complete with headers.
I use Thunderbird. There are two files for that folder: Junk.msf (7k) 
and Junk (53.172k). The msf file must be some kind of index. I just feed 
the biggest one to sa-learn?

/Regards


Re: Feeding SA-learn

2008-01-23 Thread Anthony Peacock

Diego Pomatta wrote:

Anthony Peacock escribió:

Well the short answer is, yes you can.

The slightly longer answer is that you won't get as good results 
doing this, as the Bayes system uses tokens found in the complete 
message.  By only learning on the body you will not gain any 
advantage for tokens found in headers.





Yep, I know, precisely the problem is that I don't have the original 
headers after the mail has been delivered.
My intention was to manually feed the few spam messages that slip 
thru undetected. By the time I get a hold of those, they are in the 
recipient's mail client inbox, not in the server.
I was thinking, if I save the mail as EML files, would that preserve 
the headers in a way that sa-learn can parse correctly?



Depends on the client.

For instance, Thunderbird stores it's folders in mbox format, so 
sa-learn can work against those files as-is. Other email clients can 
save emails in text format complete with headers.
I use Thunderbird. There are two files for that folder: Junk.msf (7k) 
and Junk (53.172k). The msf file must be some kind of index. I just feed 
the biggest one to sa-learn?


Yes, the .msf file is an index file.  I just copy the mbox file (Junk in 
your case) to the server and run the following command specifying the 
filename (as shown):


/usr/local/bin/spamassassin --report --mbox Junk



--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:http://www.chime.ucl.ac.uk/~rmhiajp/
"A CAT scan should take less time than a PET scan.  For a CAT scan,
 they're only looking for one thing, whereas a PET scan could result in
 a lot of things."- Carl Princi, 2002/07/19


Re: Feeding SA-learn

2008-01-23 Thread Mark Johnson

Depends on the client.

For instance, Thunderbird stores it's folders in mbox format, so 
sa-learn can work against those files as-is. Other email clients can 
save emails in text format complete with headers.
I use Thunderbird. There are two files for that folder: Junk.msf (7k) 
and Junk (53.172k). The msf file must be some kind of index. I just 
feed the biggest one to sa-learn?


Yes, the .msf file is an index file.  I just copy the mbox file (Junk in 
your case) to the server and run the following command specifying the 
filename (as shown):


/usr/local/bin/spamassassin --report --mbox Junk



I use Thunderbird as my mail client but have found that I needed to use 
Evolution to save the messages in mbox format, which was always a hassle.


My emails are stored on an IMAP server and what you suggested wasn't 
working for me.  I had the .msf file, but no corresponding mbox file. 
Because the emails are kept on the IMAP server and are not local, I had 
to enable the "Select this folder for offline use" on the "Offline" tab 
of the folder properties.  I then had the mbox file that I could copy off.


--
Mark Johnson
http://www.astroshapes.com/information-technology/blog/



Re: Feeding SA-learn

2008-01-23 Thread Anthony Peacock

Mark Johnson wrote:

Depends on the client.

For instance, Thunderbird stores it's folders in mbox format, so 
sa-learn can work against those files as-is. Other email clients can 
save emails in text format complete with headers.
I use Thunderbird. There are two files for that folder: Junk.msf (7k) 
and Junk (53.172k). The msf file must be some kind of index. I just 
feed the biggest one to sa-learn?


Yes, the .msf file is an index file.  I just copy the mbox file (Junk 
in your case) to the server and run the following command specifying 
the filename (as shown):


/usr/local/bin/spamassassin --report --mbox Junk



I use Thunderbird as my mail client but have found that I needed to use 
Evolution to save the messages in mbox format, which was always a hassle.


My emails are stored on an IMAP server and what you suggested wasn't 
working for me.  I had the .msf file, but no corresponding mbox file. 
Because the emails are kept on the IMAP server and are not local, I had 
to enable the "Select this folder for offline use" on the "Offline" tab 
of the folder properties.  I then had the mbox file that I could copy off.


Good point, I use this on folders that are saved on the local hard disk.

--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:http://www.chime.ucl.ac.uk/~rmhiajp/
"A CAT scan should take less time than a PET scan.  For a CAT scan,
 they're only looking for one thing, whereas a PET scan could result in
 a lot of things."- Carl Princi, 2002/07/19


Re: Feeding SA-learn

2008-01-23 Thread John Thompson
On 2008-01-23, Anthony Peacock <[EMAIL PROTECTED]> wrote:

>> My intention was to manually feed the few spam messages that slip thru 
>> undetected. By the time I get a hold of those, they are in the 
>> recipient's mail client inbox, not in the server.
>> I was thinking, if I save the mail as EML files, would that preserve the 
>> headers in a way that sa-learn can parse correctly?

> Depends on the client.
>
> For instance, Thunderbird stores it's folders in mbox format, so 
> sa-learn can work against those files as-is.  Other email clients can 
> save emails in text format complete with headers.
>
> The biggest problem with this is training the users to do that consistantly.

Isn't that what "cron" is for? :-)

I have a cron job on my imap server to regularly feed ham and spam 
through sa-learn.

-- 

John ([EMAIL PROTECTED])



Re: Feeding SA-learn

2008-01-23 Thread John Thompson
On 2008-01-23, Diego Pomatta <[EMAIL PROTECTED]> wrote:

> I use Thunderbird. There are two files for that folder: Junk.msf (7k) 
> and Junk (53.172k). The msf file must be some kind of index. I just feed 
> the biggest one to sa-learn?

Yup. Use "sa-learn --spam --mbox Junk" to learn your spam. You'll want 
to use the "--mbox" switch so sa-learn will process it as an mbox format 
mailbox, since that's what Thunderbird uses to store mail.

-- 

John ([EMAIL PROTECTED])



Re: Feeding SA-learn

2008-01-23 Thread John Thompson
On 2008-01-23, Mark Johnson <[EMAIL PROTECTED]> wrote:

> My emails are stored on an IMAP server and what you suggested wasn't 
> I use Thunderbird as my mail client but have found that I needed to use 
> Evolution to save the messages in mbox format, which was always a hassle.

"mbox" is already the format in which Thunderbird stores mail. What was 
the problem that caused you to use Evolution?

> Because the emails are kept on the IMAP server and are not local, I had 
> to enable the "Select this folder for offline use" on the "Offline" tab 
> of the folder properties.  I then had the mbox file that I could copy off.

If you have shell access to the machine running the imap server, use a 
cron job on the server to feed your Junk into spamassassin. 

-- 

John ([EMAIL PROTECTED])



Re: Feeding SA-learn

2008-01-23 Thread Mark Johnson

John Thompson wrote:


Isn't that what "cron" is for? :-)

I have a cron job on my imap server to regularly feed ham and spam 
through sa-learn.




Do you delete the messages from the IMAP folder after you learn them? 
If so, how do you go about that?  I'm pretty sure if I deleted the mail 
files from the command line, I have to run a reconstruct on the mailbox 
or the folder throws errors on the client.  This is on a Cyrus IMAP server.


Thanks!

--
Mark Johnson
http://www.astroshapes.com/information-technology/blog/


Re: Feeding SA-learn

2008-01-24 Thread Anthony Peacock

John Thompson wrote:

On 2008-01-23, Anthony Peacock <[EMAIL PROTECTED]> wrote:

My intention was to manually feed the few spam messages that slip thru 
undetected. By the time I get a hold of those, they are in the 
recipient's mail client inbox, not in the server.
I was thinking, if I save the mail as EML files, would that preserve the 
headers in a way that sa-learn can parse correctly?



Depends on the client.

For instance, Thunderbird stores it's folders in mbox format, so 
sa-learn can work against those files as-is.  Other email clients can 
save emails in text format complete with headers.


The biggest problem with this is training the users to do that consistantly.


Isn't that what "cron" is for? :-)

I have a cron job on my imap server to regularly feed ham and spam 
through sa-learn.


I have a cron job that runs the learning process nightly.  I was 
refering to the process of gathering the false-negatives and 
false-positives.  That has to be done by hand, as a decision needs to be 
made about whether they are spam or not.  And, by definition, the 
automatic process has got it wrong.



--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:http://www.chime.ucl.ac.uk/~rmhiajp/
"A CAT scan should take less time than a PET scan.  For a CAT scan,
 they're only looking for one thing, whereas a PET scan could result in
 a lot of things."- Carl Princi, 2002/07/19


Re: Feeding SA-learn

2008-01-24 Thread Diego Pomatta

John Thompson escribió:

On 2008-01-23, Diego Pomatta <[EMAIL PROTECTED]> wrote:

  
I use Thunderbird. There are two files for that folder: Junk.msf (7k) 
and Junk (53.172k). The msf file must be some kind of index. I just feed 
the biggest one to sa-learn?



Yup. Use "sa-learn --spam --mbox Junk" to learn your spam. You'll want 
to use the "--mbox" switch so sa-learn will process it as an mbox format 
mailbox, since that's what Thunderbird uses to store mail.


  

~/sa-learn --spam --mbox Junk
Learned tokens from 7 message(s) (7 message(s) examined)

Looks like it worked feeding it the entire Thunderbird Junk folder file. :)
Thanks all.

Btw, what the difference between using "sa-learn --spam..." and 
"spamassassin --report..." like Anthony said?


Regards


Re: Feeding SA-learn

2008-01-24 Thread Anthony Peacock

Diego Pomatta wrote:

John Thompson escribió:

On 2008-01-23, Diego Pomatta <[EMAIL PROTECTED]> wrote:

 
I use Thunderbird. There are two files for that folder: Junk.msf (7k) 
and Junk (53.172k). The msf file must be some kind of index. I just 
feed the biggest one to sa-learn?



Yup. Use "sa-learn --spam --mbox Junk" to learn your spam. You'll want 
to use the "--mbox" switch so sa-learn will process it as an mbox 
format mailbox, since that's what Thunderbird uses to store mail.


  

~/sa-learn --spam --mbox Junk
Learned tokens from 7 message(s) (7 message(s) examined)

Looks like it worked feeding it the entire Thunderbird Junk folder file. :)
Thanks all.

Btw, what the difference between using "sa-learn --spam..." and 
"spamassassin --report..." like Anthony said?


From:

http://spamassassin.apache.org/full/3.2.x/doc/spamassassin-run.html

"-r, --report
Report this message as manually-verified spam. This will submit the 
mail message read from STDIN to various spam-blocker databases. 
Currently, these are the Distributed Checksum Clearinghouse 
http://www.rhyolite.com/anti-spam/dcc/, Pyzor 
http://pyzor.sourceforge.net/, Vipul's Razor 
http://razor.sourceforge.net/, and SpamCop http://www.spamcop.net/.


If the message contains SpamAssassin markup, the markup will be 
stripped out automatically before submission. The support modules for 
DCC, Pyzor, and Razor must be installed for spam to be reported to each 
service. SpamCop reports will have greater effect if you register and 
set the spamcop_to_address option.


The message will also be submitted to SpamAssassin's learning 
systems; currently this is the internal Bayesian statistical-filtering 
system (the BAYES rules). (Note that if you only want to perform 
statistical learning, and do not want to report mail to third-parties, 
you should use the sa-learn command directly instead.)"


This option teaches the Bayesian system, but also submits to third party 
systems like DCC and SpamCop.


--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:http://www.chime.ucl.ac.uk/~rmhiajp/
"A CAT scan should take less time than a PET scan.  For a CAT scan,
 they're only looking for one thing, whereas a PET scan could result in
 a lot of things."- Carl Princi, 2002/07/19


Re: Feeding SA-learn

2008-01-24 Thread John Thompson
On 2008-01-24, Anthony Peacock <[EMAIL PROTECTED]> wrote:

> John Thompson wrote:
>> 
>> Isn't that what "cron" is for? :-)
>> 
>> I have a cron job on my imap server to regularly feed ham and spam 
>> through sa-learn.

> I have a cron job that runs the learning process nightly.  I was 
> refering to the process of gathering the false-negatives and 
> false-positives.  That has to be done by hand, as a decision needs to be 
> made about whether they are spam or not.  And, by definition, the 
> automatic process has got it wrong.

Right. So I maulally sort the false negatives/positives into their 
proper places (I don't usually get more than a couple a day) and let the 
cron job learn them later.

-- 

John ([EMAIL PROTECTED])



Re: Feeding SA-learn

2008-01-24 Thread John Thompson
On 2008-01-24, Mark Johnson <[EMAIL PROTECTED]> wrote:

> John Thompson wrote:
>
>> Isn't that what "cron" is for? :-)
>> 
>> I have a cron job on my imap server to regularly feed ham and spam 
>> through sa-learn.

> Do you delete the messages from the IMAP folder after you learn them? 
> If so, how do you go about that?  I'm pretty sure if I deleted the mail 
> files from the command line, I have to run a reconstruct on the mailbox 
> or the folder throws errors on the client.  This is on a Cyrus IMAP server.

No. I use Thunderbird and just set the Junk filter controls to expire 
junk messages after a couple weeks.

-- 

John ([EMAIL PROTECTED])



Re: Feeding SA-learn

2008-01-24 Thread Mark Johnson

John Thompson wrote:


No. I use Thunderbird and just set the Junk filter controls to expire 
junk messages after a couple weeks.




Interesting idea!  Thanks for the tips!  You have no idea how much time 
and how many steps this is going to save me.


--
Mark Johnson
http://www.astroshapes.com/information-technology/blog