Re: Spamassassin not capturing obvious Spam

Antony Stone Tue, 31 May 2016 06:28:56 -0700

On Tuesday 31 May 2016 at 15:21:19, Shivram Krishnan wrote:

> Here is my scenario. I am using SA as a oracle/ground truth for a research
> project.


Okay.

> It is generally hard to get hold of a real time mail corpus

Er, what??

> I opted for a service provided by mailinator.

> I have also trained SA using sa-learn on known public corpuses like enron
> etc.

I'm assuming from "trained" that this means you're using Bayes.  Two comments:

1. Where are you getting the "ham" from to train SA with, because it needs 
this as well as the "spam"?

2. You should be aware (*especially* if using this stuff as the basis of a 
research project - any competent referee should pick up on something like 
this) that SA works best when the emails it is asked to process are from the 
same source as it has been trained with.  In other words, you shovel real 
emails through a real mail server and train SA using this spam and ham; you 
then use that trains SA to assess mail passing through that same mail server, 
for the same users.  Anything significantly varying from this is not going to 
work well, and is certainly not a good test of how well SA works.

> What do you guys suggest me to do in this case? Is there a better way to do
> it?

Yes, run a real mail server and process real emails.

Can you tell us anything more about what the research project is, for which 
you are using SA as an "oracle / ground truth"?


Antony.

-- 
It is also possible that putting the birds in a laboratory setting 
inadvertently renders them relatively incompetent.

 - Daniel C Dennett

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Spamassassin not capturing obvious Spam

Reply via email to