Re: Spamassassin not capturing obvious Spam

2016-06-04 Thread Bill Cole

On 31 May 2016, at 2:18, Shivram Krishnan wrote:


It is not on production. I am using this to evaluate spamassassin.


That is entirely unnecessary and will break the autolearning subsystem 
if you have it enabled.


To get a full report of the rules hit and their scores, use the '-t' 
option with the spamassassin tool or if you are using spamc add the '-R' 
option.


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
Agreed that I do not have experience. I am just playing my cards out here
to get a corpus of mails.

Thanks guys!

On Tue, May 31, 2016 at 11:20 AM, Reindl Harald 
wrote:

>
>
> Am 31.05.2016 um 20:16 schrieb Antony Stone:
>
>> On Tuesday 31 May 2016 at 20:11:14, Shivram Krishnan wrote:
>>
>> In the glue - like spamass-mailer, there would be two folders which are
>>> created. One would be the mailbox and the other would be a spambox(dont
>>> know the term). Cant you access the spambox to extract the mail?
>>>
>>
>> It sounds to me that you would benefit greatly from following one of the
>> several tutorials available to help you set up SpamAssassin on a mail
>> server,
>> and play with it for a while to understand better how it works and what is
>> possible
>>
>
> +1
>
> just "spamass-mailer" instead "spamass-milter" because not clue what
> "glue" or "milter" means and talking about folders shows zero expierience
>
> expierience don't come from asking a ton of questions, it comes from just
> doing things as it took here 2 months for replacing a commercial
> spamfilter-gateway from scratch
>
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald



Am 31.05.2016 um 20:16 schrieb Antony Stone:

On Tuesday 31 May 2016 at 20:11:14, Shivram Krishnan wrote:


In the glue - like spamass-mailer, there would be two folders which are
created. One would be the mailbox and the other would be a spambox(dont
know the term). Cant you access the spambox to extract the mail?


It sounds to me that you would benefit greatly from following one of the
several tutorials available to help you set up SpamAssassin on a mail server,
and play with it for a while to understand better how it works and what is
possible


+1

just "spamass-mailer" instead "spamass-milter" because not clue what 
"glue" or "milter" means and talking about folders shows zero expierience


expierience don't come from asking a ton of questions, it comes from 
just doing things as it took here 2 months for replacing a commercial 
spamfilter-gateway from scratch




signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
In the glue - like spamass-mailer, there would be two folders which are
created. One would be the mailbox and the other would be a spambox(dont
know the term). Cant you access the spambox to extract the mail?

On Tue, May 31, 2016 at 11:01 AM, Reindl Harald 
wrote:

>
>
> Am 31.05.2016 um 19:55 schrieb Shivram Krishnan:
>
>> There will a point where the decision to drop the mail is made based on
>> the headers. Cant we log it there?
>>
>
> SA don't make any decisions of drop / reject
> the glue does - spamass-milter, amavis or whatever
>
> and even if - i would find it pervert to make logging in the glue instead
> have "spamd: result:" log-lines already present which gives you a lot of
> informations (hitted rules, score, message-id if avalaibale) extended with
> the envelope-informations
>
> not bothering about your usecase - just the fact that you have two
> different loglines cover each a small piece or a lot of noise where most
> information is redundant
>
> IMHO it is a major bug not include the envelopes (which are known to spamd
> anyways) in the existing logging
>
> On Tue, May 31, 2016 at 10:30 AM, Reindl Harald > > wrote:
>>
>>
>>
>> Am 31.05.2016 um 19:25 schrieb Shivram Krishnan:
>>
>> Thanks guys.
>>
>> What I am going to ask might be a longshot.
>>
>> But is it possible for anyone who is running a mailserver to
>> give a list
>> of source of SPAM (recent , anytime this year)and the SA score
>> associated? It will be extremely useful for my research and
>> credit would
>> be given. Example:-
>> efetunisie.org  ,6.3
>> abcxcf.com  ,5.7
>>
>>
>> problem is that SpamAssassin don't log envelope-adresses at all in
>> the "spamd: result:" lines and so you even have no point to anything
>> else than this line in case of mails with a missing message-ID
>>
>> discussed this here - nobody cares "use a glue with it's own logging
>> - bla"
>>
>
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
Hi Antony,

I have an ongoing collection of Blacklists since Jan 1 ,2016. This way I
would know how long it has stayed on the Blacklist.

"Dealing with email "after the event" (especially with regard to blacklists)
will give you very different results from dealing with it as it happens, if
for
no other reason than the spam which you and lots of other mail admins
receive
is the very trigger which causes the IP address to go onto the blacklist."

This the exact reason why I use Mailinator (which has some problems, refer
previous mails in this thread), I get a live flow of mails. And I agree
that there is no point evaluating my study on after the event mails.

To evaluate the performance of my study I am using SA.



On Tue, May 31, 2016 at 10:44 AM, Antony Stone <
antony.st...@spamassassin.open.source.it> wrote:

> On Tuesday 31 May 2016 at 15:47:56, Shivram Krishnan wrote:
>
> > I am using SA as an oracle for Blacklisting. Our research concerns with
> > combining multiple sources of blacklist and also consider the historical
> > importance of an IP in a blacklist to create a very effective master
> > blacklist.
> >
> > Let me give you an example.
> > Suppose an IP address 1.2.3.4 appeared on Jan 1 ,2016 in Blacklist A.
> > 1.2.3.4 stayed on Blacklist A for about 12 hours.
> >
> > We have developed a system which assigns a score to 1.2.3.4. If the score
> > allocated to 1.2.3.4 is high we include it in our Master Blacklist.
> >
> > To evaluate the performance of the master Blacklist in terms of hitrate
> and
> > false positives we plan to use SA.
>
> How do you plan to find out when 1.2.3.4 appeared on the blacklist, and how
> long for, if you are not dealing with live mail flowing through a real mail
> server?
>
> Dealing with email "after the event" (especially with regard to blacklists)
> will give you very different results from dealing with it as it happens,
> if for
> no other reason than the spam which you and lots of other mail admins
> receive
> is the very trigger which causes the IP address to go onto the blacklist.
>
> I think you should try to show in advance that your methods, and what you
> are
> measuring, are a valid way of assessing spam from different addresses, in
> order
> for the study to be useful.
>
>
> Regards,
>
>
> Antony.
>
> --
> If you were ploughing a field, which would you rather use - two strong
> oxen or
> 1024 chickens?
>
>  - Seymour Cray, pioneer of supercomputing
>
>Please reply to the
> list;
>  please *don't* CC
> me.
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Antony Stone
On Tuesday 31 May 2016 at 15:47:56, Shivram Krishnan wrote:

> I am using SA as an oracle for Blacklisting. Our research concerns with
> combining multiple sources of blacklist and also consider the historical
> importance of an IP in a blacklist to create a very effective master
> blacklist.
> 
> Let me give you an example.
> Suppose an IP address 1.2.3.4 appeared on Jan 1 ,2016 in Blacklist A.
> 1.2.3.4 stayed on Blacklist A for about 12 hours.
> 
> We have developed a system which assigns a score to 1.2.3.4. If the score
> allocated to 1.2.3.4 is high we include it in our Master Blacklist.
> 
> To evaluate the performance of the master Blacklist in terms of hitrate and
> false positives we plan to use SA.

How do you plan to find out when 1.2.3.4 appeared on the blacklist, and how 
long for, if you are not dealing with live mail flowing through a real mail 
server?

Dealing with email "after the event" (especially with regard to blacklists) 
will give you very different results from dealing with it as it happens, if for 
no other reason than the spam which you and lots of other mail admins receive 
is the very trigger which causes the IP address to go onto the blacklist.

I think you should try to show in advance that your methods, and what you are 
measuring, are a valid way of assessing spam from different addresses, in order 
for the study to be useful.


Regards,


Antony.

-- 
If you were ploughing a field, which would you rather use - two strong oxen or 
1024 chickens?

 - Seymour Cray, pioneer of supercomputing

   Please reply to the list;
 please *don't* CC me.


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald



Am 31.05.2016 um 19:25 schrieb Shivram Krishnan:

Thanks guys.

What I am going to ask might be a longshot.

But is it possible for anyone who is running a mailserver to give a list
of source of SPAM (recent , anytime this year)and the SA score
associated? It will be extremely useful for my research and credit would
be given. Example:-
efetunisie.org ,6.3
abcxcf.com ,5.7


problem is that SpamAssassin don't log envelope-adresses at all in the 
"spamd: result:" lines and so you even have no point to anything else 
than this line in case of mails with a missing message-ID


discussed this here - nobody cares "use a glue with it's own logging - bla"


You might think that there will be privacy issues, but I am asking only
for SPAM mails which would be filtered anyways. I need a large corpus of
mails for evaluating my technique.

On Tue, May 31, 2016 at 8:55 AM, Bowie Bailey > wrote:

On 5/31/2016 1:38 AM, @lbutlr wrote:

On May 30, 2016, at 11:06 PM, Shivram Krishnan
> wrote:

2) I have set a threshold of -10 to see how spamassassin
assigns a score for every mail.

No. Do not do this.


Instead, set this option in your local.cf  file:

add_header all Report _REPORT_

This will make SA add a report header to all emails so you can see
how they score.  As a plus, you will also see whether it's marked as
ham or spam, which you lose by artificially lowering the threshold.

You can also use this if you want more info on Bayes scoring (should
be all one line):

add_header all Bayes bayes=_BAYES_,
N=_BAYESTC_(_BAYESTCLEARNED_-_BAYESTCHAMMY_+_BAYESTCSPAMMY_),
ham=(_HAMMYTOKENS(5,short)_), spam=(_SPAMMYTOKENS(5,short)_)





signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
Thanks guys.

What I am going to ask might be a longshot.

But is it possible for anyone who is running a mailserver to give a list of
source of SPAM (recent , anytime this year)and the SA score associated? It
will be extremely useful for my research and credit would be given.
Example:-
efetunisie.org,6.3
abcxcf.com,5.7
.
.
.


You might think that there will be privacy issues, but I am asking only for
SPAM mails which would be filtered anyways. I need a large corpus of mails
for evaluating my technique.

On Tue, May 31, 2016 at 8:55 AM, Bowie Bailey  wrote:

> On 5/31/2016 1:38 AM, @lbutlr wrote:
>
>> On May 30, 2016, at 11:06 PM, Shivram Krishnan 
>> wrote:
>>
>>> 2) I have set a threshold of -10 to see how spamassassin assigns a score
>>> for every mail.
>>>
>> No. Do not do this.
>>
>
> Instead, set this option in your local.cf file:
>
> add_header all Report _REPORT_
>
> This will make SA add a report header to all emails so you can see how
> they score.  As a plus, you will also see whether it's marked as ham or
> spam, which you lose by artificially lowering the threshold.
>
> You can also use this if you want more info on Bayes scoring (should be
> all one line):
>
> add_header all Bayes bayes=_BAYES_,
> N=_BAYESTC_(_BAYESTCLEARNED_-_BAYESTCHAMMY_+_BAYESTCSPAMMY_),
> ham=(_HAMMYTOKENS(5,short)_), spam=(_SPAMMYTOKENS(5,short)_)
>
> --
> Bowie
>
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Bowie Bailey

On 5/31/2016 1:38 AM, @lbutlr wrote:

On May 30, 2016, at 11:06 PM, Shivram Krishnan  wrote:

2) I have set a threshold of -10 to see how spamassassin assigns a score for 
every mail.

No. Do not do this.


Instead, set this option in your local.cf file:

add_header all Report _REPORT_

This will make SA add a report header to all emails so you can see how 
they score.  As a plus, you will also see whether it's marked as ham or 
spam, which you lose by artificially lowering the threshold.


You can also use this if you want more info on Bayes scoring (should be 
all one line):


add_header all Bayes bayes=_BAYES_, 
N=_BAYESTC_(_BAYESTCLEARNED_-_BAYESTCHAMMY_+_BAYESTCSPAMMY_), 
ham=(_HAMMYTOKENS(5,short)_), spam=(_SPAMMYTOKENS(5,short)_)


--
Bowie



Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald



Am 31.05.2016 um 17:13 schrieb Shivram Krishnan:

I might be forced to do this. Take the corpus from Mailinator and
manually mark it as SPAM or HAM and use sa-learn to train spamassassin.

But this is what is confusing me. doesnt SA use a lot more tags, to
determine if it is a SPAM or HAM? does this mean that sa-learn is not
only for bayes but also for all the tags which get triggered in the mail?


sa-learn is *only* for the bayes but since you have no clean and 
uncrippeled messages from there you can't expect other rules working 
proper nor bayes trained with that data working proper for real email



On Tue, May 31, 2016 at 8:07 AM, Antony Stone
> wrote:

On Tuesday 31 May 2016 at 17:02:26, Reindl Harald wrote:

> Am 31.05.2016 um 16:59 schrieb Antony Stone:
> >
> > I had read SA documentation such as
> > https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html

> that's all based on opinions - the only question is the quality of
> training and i don't base my decisions and what i say on some opionions
> on a website but a ton of accounts on both involved copmanies sharing
> bayes database for inbound and outgoing mail

That's fair enough, but I think someone just starting out with SA,
or doing a
research project, or simply not handling the large quantity of email
that you
do (and able to put in the effort of hand-tuning which you appear to
do as
well) has to get their starting point from somewhere, and the
official project
website is something most people would regard as "good advice".

> well, with the defaults of auto-learning that opinions maybe are true

In which case maybe it's useful for the original poster after all.




signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
I might be forced to do this. Take the corpus from Mailinator and manually
mark it as SPAM or HAM and use sa-learn to train spamassassin.

But this is what is confusing me. doesnt SA use a lot more tags, to
determine if it is a SPAM or HAM? does this mean that sa-learn is not only
for bayes but also for all the tags which get triggered in the mail?

On Tue, May 31, 2016 at 8:07 AM, Antony Stone <
antony.st...@spamassassin.open.source.it> wrote:

> On Tuesday 31 May 2016 at 17:02:26, Reindl Harald wrote:
>
> > Am 31.05.2016 um 16:59 schrieb Antony Stone:
> > >
> > > I had read SA documentation such as
> > > https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html
>
> > that's all based on opinions - the only question is the quality of
> > training and i don't base my decisions and what i say on some opionions
> > on a website but a ton of accounts on both involved copmanies sharing
> > bayes database for inbound and outgoing mail
>
> That's fair enough, but I think someone just starting out with SA, or
> doing a
> research project, or simply not handling the large quantity of email that
> you
> do (and able to put in the effort of hand-tuning which you appear to do as
> well) has to get their starting point from somewhere, and the official
> project
> website is something most people would regard as "good advice".
>
> > well, with the defaults of auto-learning that opinions maybe are true
>
> In which case maybe it's useful for the original poster after all.
>
>
> Antony.
>
> --
> "In fact I wanted to be John Cleese and it took me some time to realise
> that
> the job was already taken."
>
>  - Douglas Adams
>
>Please reply to the
> list;
>  please *don't* CC
> me.
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Antony Stone
On Tuesday 31 May 2016 at 17:02:26, Reindl Harald wrote:

> Am 31.05.2016 um 16:59 schrieb Antony Stone:
> > 
> > I had read SA documentation such as
> > https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html

> that's all based on opinions - the only question is the quality of
> training and i don't base my decisions and what i say on some opionions
> on a website but a ton of accounts on both involved copmanies sharing
> bayes database for inbound and outgoing mail

That's fair enough, but I think someone just starting out with SA, or doing a 
research project, or simply not handling the large quantity of email that you 
do (and able to put in the effort of hand-tuning which you appear to do as 
well) has to get their starting point from somewhere, and the official project 
website is something most people would regard as "good advice".

> well, with the defaults of auto-learning that opinions maybe are true

In which case maybe it's useful for the original poster after all.


Antony.

-- 
"In fact I wanted to be John Cleese and it took me some time to realise that 
the job was already taken."

 - Douglas Adams

   Please reply to the list;
 please *don't* CC me.


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald


Am 31.05.2016 um 16:59 schrieb Antony Stone:

On Tuesday 31 May 2016 at 15:32:49, Reindl Harald wrote:


Am 31.05.2016 um 15:28 schrieb Antony Stone:

2. You should be aware (*especially* if using this stuff as the basis of
a research project - any competent referee should pick up on something
like this) that SA works best when the emails it is asked to process are
from the same source as it has been trained with.  In other words, you
shovel real emails through a real mail server and train SA using this
spam and ham; you then use that trains SA to assess mail passing through
that same mail server, for the same users.  Anything significantly
varying from this is not going to work well, and is certainly not a good
test of how well SA works.


not true - i heard similar nonsense about "you can't re-use you MX bayes
database on a submission server" - i can, do and it works like a charm


Oh!

I had read SA documentation such as
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which contains
comments such as:

"The pros of Bayesian spam analysis:
Can greatly reduce false positives and false negatives.
 - It learns from your mail, so it is tailored to your unique e-mail flow."

"You're urged to avoid using a publicly available corpus (sample) - this must
be taken from YOUR mail server, if it is to be statistically useful.
Otherwise, the results may be pretty skewed."

If this sort of advice is incorrect, maybe a request should be raised with the
SA developers to update the official documentation?


that's all based on opinions - the only question is the quality of 
training and i don't base my decisions and what i say on some opionions 
on a website but a ton of accounts on both involved copmanies sharing 
bayes database for inbound and outgoing mail


well, with the defaults of auto-learning that opinions maybe are true



signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Antony Stone
On Tuesday 31 May 2016 at 15:32:49, Reindl Harald wrote:

> Am 31.05.2016 um 15:28 schrieb Antony Stone:
> > 2. You should be aware (*especially* if using this stuff as the basis of
> > a research project - any competent referee should pick up on something
> > like this) that SA works best when the emails it is asked to process are
> > from the same source as it has been trained with.  In other words, you
> > shovel real emails through a real mail server and train SA using this
> > spam and ham; you then use that trains SA to assess mail passing through
> > that same mail server, for the same users.  Anything significantly
> > varying from this is not going to work well, and is certainly not a good
> > test of how well SA works.
> 
> not true - i heard similar nonsense about "you can't re-use you MX bayes
> database on a submission server" - i can, do and it works like a charm

Oh!

I had read SA documentation such as
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which contains 
comments such as:

"The pros of Bayesian spam analysis:
Can greatly reduce false positives and false negatives.
 - It learns from your mail, so it is tailored to your unique e-mail flow."

"You're urged to avoid using a publicly available corpus (sample) - this must 
be taken from YOUR mail server, if it is to be statistically useful. 
Otherwise, the results may be pretty skewed."


If this sort of advice is incorrect, maybe a request should be raised with the 
SA developers to update the official documentation?


Antony.

-- 
If the human brain were so simple that we could understand it,
we'd be so simple that we couldn't.

   Please reply to the list;
 please *don't* CC me.


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
BTW I am using SA as an oracle for Blacklisting. Our research concerns with
combining multiple sources of blacklist and also consider the historical
importance of an IP in a blacklist to create a very effective master
blacklist.

Let me give you an example.
Suppose an IP address 1.2.3.4 appeared on Jan 1 ,2016 in Blacklist A.
1.2.3.4 stayed on Blacklist A for about 12 hours.

We have developed a system which assigns a score to 1.2.3.4. If the score
allocated to 1.2.3.4 is high we include it in our Master Blacklist.

To evaluate the performance of the master Blacklist in terms of hitrate and
false positives we plan to use SA.

On Tue, May 31, 2016 at 6:43 AM, Shivram Krishnan 
wrote:

> The data set which i use for bayes consists of both ham and spam. (
> https://www.cs.cmu.edu/~./enron/)
>
> Lets consider a scenario, where I have a domain and I point it to a
> mailserver. It might take a while for me to generate 50,000 mails a day (
> mailinator provides me this) . I need to embed multiple mail ids into
> several forums for the web scrapers to pick it up.
>
> I have tried to get hold of mails from my university - but it is a long
> and tedious process.
>
> I can try the method which Reindl suggested.
>
>
>
> On Tue, May 31, 2016 at 6:32 AM, Reindl Harald 
> wrote:
>
>>
>>
>> Am 31.05.2016 um 15:28 schrieb Antony Stone:
>>
>>> 2. You should be aware (*especially* if using this stuff as the basis of
>>> a
>>> research project - any competent referee should pick up on something like
>>> this) that SA works best when the emails it is asked to process are from
>>> the
>>> same source as it has been trained with.  In other words, you shovel real
>>> emails through a real mail server and train SA using this spam and ham;
>>> you
>>> then use that trains SA to assess mail passing through that same mail
>>> server,
>>> for the same users.  Anything significantly varying from this is not
>>> going to
>>> work well, and is certainly not a good test of how well SA works.
>>>
>>
>> not true - i heard similar nonsense about "you can't re-use you MX bayes
>> database on a submission server" - i can, do and it works like a charm
>>
>> our current corpus is 9 mails large, conatins samples in many
>> languages for many users (site-wide setup) and that bayes is shared with
>> another company for more than a year now and has the same results there as
>> here (96% hit quote)
>>
>>
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
The data set which i use for bayes consists of both ham and spam. (
https://www.cs.cmu.edu/~./enron/)

Lets consider a scenario, where I have a domain and I point it to a
mailserver. It might take a while for me to generate 50,000 mails a day (
mailinator provides me this) . I need to embed multiple mail ids into
several forums for the web scrapers to pick it up.

I have tried to get hold of mails from my university - but it is a long and
tedious process.

I can try the method which Reindl suggested.



On Tue, May 31, 2016 at 6:32 AM, Reindl Harald 
wrote:

>
>
> Am 31.05.2016 um 15:28 schrieb Antony Stone:
>
>> 2. You should be aware (*especially* if using this stuff as the basis of a
>> research project - any competent referee should pick up on something like
>> this) that SA works best when the emails it is asked to process are from
>> the
>> same source as it has been trained with.  In other words, you shovel real
>> emails through a real mail server and train SA using this spam and ham;
>> you
>> then use that trains SA to assess mail passing through that same mail
>> server,
>> for the same users.  Anything significantly varying from this is not
>> going to
>> work well, and is certainly not a good test of how well SA works.
>>
>
> not true - i heard similar nonsense about "you can't re-use you MX bayes
> database on a submission server" - i can, do and it works like a charm
>
> our current corpus is 9 mails large, conatins samples in many
> languages for many users (site-wide setup) and that bayes is shared with
> another company for more than a year now and has the same results there as
> here (96% hit quote)
>
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald



Am 31.05.2016 um 15:28 schrieb Antony Stone:

2. You should be aware (*especially* if using this stuff as the basis of a
research project - any competent referee should pick up on something like
this) that SA works best when the emails it is asked to process are from the
same source as it has been trained with.  In other words, you shovel real
emails through a real mail server and train SA using this spam and ham; you
then use that trains SA to assess mail passing through that same mail server,
for the same users.  Anything significantly varying from this is not going to
work well, and is certainly not a good test of how well SA works.


not true - i heard similar nonsense about "you can't re-use you MX bayes 
database on a submission server" - i can, do and it works like a charm


our current corpus is 9 mails large, conatins samples in many 
languages for many users (site-wide setup) and that bayes is shared with 
another company for more than a year now and has the same results there 
as here (96% hit quote)




signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald



Am 31.05.2016 um 15:21 schrieb Shivram Krishnan:

Here is my scenario. I am using SA as a oracle/ground truth for a
research project. It is generally hard to get hold of a real time mail
corpus


nope, just point a cheap domain to a mailserver accepting all incoming 
stuff and spread some hidden mail-links to it



so I opted for a service provided by mailinator. Mailinator is a
company which provides users with disposable email ID's and it offers an
API to obtain the mails of the disposable ID's. Unfortunately it
provides the mail in JSON, and SA takes the mail in RFC 2822.


which is a problem - you can't seriously use any classification of 
something which would never appear that way in a real mailflow



I have written a script which converts JSON to RFC 2822 (though there
are a lot of specifications on the RFC 2822 , I managed to capture most
of them just so that SA has something to work with)


but the received headers are crap


I have also trained SA using sa-learn on known public corpuses like
enron etc.

I use SA, to classify the converted mails from Mailinator as HAM or SPAM.

for example, if a mail is stored in the text file mail.txt I run

spamassassin mail.txt

This returns the necessary score for me to decide if it is SPAM or not.

What do you guys suggest me to do in this case? Is there a better way to
do it?


FIRST strip out the new line at the begin which implies "end of headers" 
and at least generate useable received-headers


frankly i have no idea why bayes classification changes completly with 
no useful received headers - i started to strip them all with "formail" 
form our corpus and got unpreictable and not logical results doing 
bayes-masstest on the corpus


by just strip any header and add a generic one at the begin of the 
samles things got predictable and as expected


since that day *all samples* have with makes the bayes database also 
better compressable (on a sane setup with no autoexpire the date don't 
matter at all)


Received: from mx.example.com (mx.example.com [91.119.73.19])
 for ; Mon, 9 May 2016 19:20:00 +0200 (CEST)


On Tue, May 31, 2016 at 1:48 AM, Reindl Harald > wrote:



Am 31.05.2016 um 08:18 schrieb Shivram Krishnan:

It is not on production. I am using this to evaluate spamassassin.


how will you evaluate something when you slay your setup that way?




signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Antony Stone
On Tuesday 31 May 2016 at 15:21:19, Shivram Krishnan wrote:

> Here is my scenario. I am using SA as a oracle/ground truth for a research
> project.

Okay.

> It is generally hard to get hold of a real time mail corpus

Er, what??

> I opted for a service provided by mailinator.

> I have also trained SA using sa-learn on known public corpuses like enron
> etc.

I'm assuming from "trained" that this means you're using Bayes.  Two comments:

1. Where are you getting the "ham" from to train SA with, because it needs 
this as well as the "spam"?

2. You should be aware (*especially* if using this stuff as the basis of a 
research project - any competent referee should pick up on something like 
this) that SA works best when the emails it is asked to process are from the 
same source as it has been trained with.  In other words, you shovel real 
emails through a real mail server and train SA using this spam and ham; you 
then use that trains SA to assess mail passing through that same mail server, 
for the same users.  Anything significantly varying from this is not going to 
work well, and is certainly not a good test of how well SA works.

> What do you guys suggest me to do in this case? Is there a better way to do
> it?

Yes, run a real mail server and process real emails.

Can you tell us anything more about what the research project is, for which 
you are using SA as an "oracle / ground truth"?


Antony.

-- 
It is also possible that putting the birds in a laboratory setting 
inadvertently renders them relatively incompetent.

 - Daniel C Dennett

   Please reply to the list;
 please *don't* CC me.


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
Here is my scenario. I am using SA as a oracle/ground truth for a research
project. It is generally hard to get hold of a real time mail corpus, so I
opted for a service provided by mailinator. Mailinator is a company which
provides users with disposable email ID's and it offers an API to obtain
the mails of the disposable ID's. Unfortunately it provides the mail in
JSON, and SA takes the mail in RFC 2822.

I have written a script which converts JSON to RFC 2822 (though there are a
lot of specifications on the RFC 2822 , I managed to capture most of them
just so that SA has something to work with)

I have also trained SA using sa-learn on known public corpuses like enron
etc.

I use SA, to classify the converted mails from Mailinator as HAM or SPAM.

for example, if a mail is stored in the text file mail.txt I run

spamassassin mail.txt

This returns the necessary score for me to decide if it is SPAM or not.

What do you guys suggest me to do in this case? Is there a better way to do
it?




On Tue, May 31, 2016 at 1:48 AM, Reindl Harald 
wrote:

>
>
> Am 31.05.2016 um 08:18 schrieb Shivram Krishnan:
>
>> It is not on production. I am using this to evaluate spamassassin.
>>
>
> how will you evaluate something when you slay your setup that way?
>
> On Mon, May 30, 2016 at 10:38 PM, @lbutlr > > wrote:
>>
>> On May 30, 2016, at 11:06 PM, Shivram Krishnan > > wrote:
>> > 2) I have set a threshold of -10 to see how spamassassin assigns a
>> score for every mail.
>>
>> No. Do not do this
>>
>
>


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald



Am 31.05.2016 um 08:18 schrieb Shivram Krishnan:

It is not on production. I am using this to evaluate spamassassin.


how will you evaluate something when you slay your setup that way?


On Mon, May 30, 2016 at 10:38 PM, @lbutlr > wrote:

On May 30, 2016, at 11:06 PM, Shivram Krishnan > wrote:
> 2) I have set a threshold of -10 to see how spamassassin assigns a score 
for every mail.

No. Do not do this




signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Reindl Harald



Am 31.05.2016 um 04:24 schrieb Shivram Krishnan:

I am testing spamassassin on a SPAM/HAM corpus of mails. Spamassassin is
not picking up an obvious spam like in this case
http://pastebin.com/MbNRNFWy .


you sample is mangeled and hence crap
it's even damaged because a leading newline
frankly you even removed the SA-headers











signature.asc
Description: OpenPGP digital signature


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Dave Funk

OK,

So you are testing to see how SA scores artificial mail messages.
However SA is designed to evaluate real mail messages, not botched
fabrications of them, so I don't understand what you are trying to achieve.

You have (either deliberately or unknowingly) omitted the necessary
information that SA needs to perform meaningful network based tests.

If you want to test SA with network based tests explicitly disabled there
are command line (or configuration) options to achieve that. When you use
those options it causes SA to "shift gears" and changes how various
remaining parts are utilized.

So in a way you are crippling SA by withholding info it needs for network
based tests but not telling it that you are doing that so it doesn't
"know" to bring full force of the non-network components to bear.
I'm not surprised that its performance is sub-par in this situation.

What are you trying to achieve with this artificial scenario?

On Mon, 30 May 2016, Shivram Krishnan wrote:


1) The message is indeed fabricated. I had to generate a RFC 2822 mail from 
JSON. I am harvesting SPAM mails from
mailinator.com (public email's). So that is an error in my generation of the 
RFC 2822. I did not change it as
spamassassin did not assign a score.
2) I have set a threshold of -10 to see how spamassassin assigns a score for every mail. 




On Mon, May 30, 2016 at 8:25 PM, Dave Funk  wrote:
  That message is either a fabrication or something from a messed up system.
  There's no sign of an IP address (neither IPv4 nor IPv6) in it.

  There are two identical 'Received:' headers which have '()' where
  there should be at least the IP address of the incoming connection.

  This indicates that the message has either been tampered with or is from 
a postfix system that somebody has
  messed up the configuration.


--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread LuKreme
On May 31, 2016, at 00:18, Shivram Krishnan  wrote:
> It is not on production. I am using this to evaluate spamassassin.

You are not testing or evaluating properly when you break the configuration.

-- 



Re: Spamassassin not capturing obvious Spam

2016-05-31 Thread Shivram Krishnan
It is not on production. I am using this to evaluate spamassassin.

On Mon, May 30, 2016 at 10:38 PM, @lbutlr  wrote:

> On May 30, 2016, at 11:06 PM, Shivram Krishnan 
> wrote:
> > 2) I have set a threshold of -10 to see how spamassassin assigns a score
> for every mail.
>
> No. Do not do this.
>
> --
> When the routine bites hard / and ambitions are low And the resentment
> rides high / but emotions won't grow And we're changing our ways, /
> taking different roads Then love, love will tear us apart again
>
>


Re: Spamassassin not capturing obvious Spam

2016-05-30 Thread @lbutlr
On May 30, 2016, at 11:06 PM, Shivram Krishnan  wrote:
> 2) I have set a threshold of -10 to see how spamassassin assigns a score for 
> every mail. 

No. Do not do this.

-- 
When the routine bites hard / and ambitions are low And the resentment
rides high / but emotions won't grow And we're changing our ways, /
taking different roads Then love, love will tear us apart again



Re: Spamassassin not capturing obvious Spam

2016-05-30 Thread Shivram Krishnan
1) The message is indeed fabricated. I had to generate a RFC 2822 mail from
JSON. I am harvesting SPAM mails from mailinator.com (public email's). So
that is an error in my generation of the RFC 2822. I did not change it as
spamassassin did not assign a score.

2) I have set a threshold of -10 to see how spamassassin assigns a score
for every mail.



On Mon, May 30, 2016 at 8:25 PM, Dave Funk 
wrote:

> That message is either a fabrication or something from a messed up system.
> There's no sign of an IP address (neither IPv4 nor IPv6) in it.
>
> There are two identical 'Received:' headers which have '()' where
> there should be at least the IP address of the incoming connection.
>
> This indicates that the message has either been tampered with or is from a
> postfix system that somebody has messed up the configuration.
>
>
>
> On Mon, 30 May 2016, Shivram Krishnan wrote:
>
> Hey guys,
>>
>> I am testing spamassassin on a SPAM/HAM corpus of mails. Spamassassin is
>> not picking up an obvious
>> spam like in this case http://pastebin.com/MbNRNFWy .
>>
>> I have followed the guidelines on
>> https://wiki.apache.org/spamassassin/ImproveAccuracy .
>>
>> Let me know how to catch these type of Spams. It would be interesting to
>> know what your spamassassin
>> assigns the score for this spam.
>>
>> spamassassin assigned this score -
>>
>> Content analysis details:   (3.9 points, -10.0 required)
>>
>>pts rule name  description
>>  --
>> --
>>  0.8 BAYES_50   BODY: Bayes spam probability is 40 to 60%
>> [score: 0.4292]
>>  0.0 HTML_MESSAGE   BODY: HTML included in message
>>  0.7 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
>>  0.4 HTML_MIME_NO_HTML_TAG  HTML-only message, but there is no HTML tag
>>  0.0 UNPARSEABLE_RELAY  Informational: message has unparseable relay
>> lines
>>  2.0 XPRIO  Has X-Priority header
>>
>>
>>
>> Notice that none of the  other body tags are triggered.
>>
>> Thanks,
>>
>> Shivram
>>
>>
>>
> --
> Dave Funk  University of Iowa
> College of Engineering
> 319/335-5751   FAX: 319/384-0549   1256 Seamans Center
> Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
> #include 
> Better is not better, 'standard' is better. B{


Re: Spamassassin not capturing obvious Spam

2016-05-30 Thread Dave Funk

That message is either a fabrication or something from a messed up system.
There's no sign of an IP address (neither IPv4 nor IPv6) in it.

There are two identical 'Received:' headers which have '()' where
there should be at least the IP address of the incoming connection.

This indicates that the message has either been tampered with or is from a 
postfix system that somebody has messed up the configuration.



On Mon, 30 May 2016, Shivram Krishnan wrote:


Hey guys,

I am testing spamassassin on a SPAM/HAM corpus of mails. Spamassassin is not 
picking up an obvious
spam like in this case http://pastebin.com/MbNRNFWy .

I have followed the guidelines on 
https://wiki.apache.org/spamassassin/ImproveAccuracy .

Let me know how to catch these type of Spams. It would be interesting to know 
what your spamassassin
assigns the score for this spam.

spamassassin assigned this score -

Content analysis details:   (3.9 points, -10.0 required)

   pts rule name              description
 -- --
 0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
                            [score: 0.4292]
 0.0 HTML_MESSAGE           BODY: HTML included in message
 0.7 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
 0.4 HTML_MIME_NO_HTML_TAG  HTML-only message, but there is no HTML tag
 0.0 UNPARSEABLE_RELAY      Informational: message has unparseable relay lines
 2.0 XPRIO                  Has X-Priority header



Notice that none of the  other body tags are triggered.

Thanks,

Shivram




--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

Re: Spamassassin not capturing obvious Spam

2016-05-30 Thread LuKreme
On May 30, 2016, at 20:24, Shivram Krishnan  wrote:
> I have followed the guidelines on 
> https://wiki.apache.org/spamassassin/ImproveAccuracy .

No, you really haven't.

> Content analysis details:   (3.9 points, -10.0 required)

This makes no sense at all. Either you have set the spam scores negative, which 
makes no sense, or you have set it to 10, which makes no sense.

Train more spam and don't muck with the levels.



Re: Spamassassin not capturing obvious Spam

2016-05-30 Thread Rob McEwen

On 5/30/2016 10:24 PM, Shivram Krishnan wrote:

I am testing spamassassin on a SPAM/HAM corpus of mails. Spamassassin is
not picking up an obvious spam like in this case
http://pastebin.com/MbNRNFWy .


Your pastebin example didn't show the "last external" sending IP. Could 
have have been there orginally, but was expunged from this sample? Could 
there have also been a link in the body of the message that was likewise 
removed?


it would be nice to be able to check those against respected low-FP DNSBLs.

Or, if the clickable link really wasn't in the original message, then 
this particular example was probably a rare malfunctioned spam that will 
be of no benefit to the spammer, and would then probably not be worth 
investigating since the spammer then has no incentive to keep sending 
these types.


--
Rob McEwen




Spamassassin not capturing obvious Spam

2016-05-30 Thread Shivram Krishnan
Hey guys,

I am testing spamassassin on a SPAM/HAM corpus of mails. Spamassassin is
not picking up an obvious spam like in this case
http://pastebin.com/MbNRNFWy .

I have followed the guidelines on
https://wiki.apache.org/spamassassin/ImproveAccuracy .

Let me know how to catch these type of Spams. It would be interesting to
know what your spamassassin assigns the score for this spam.

spamassassin assigned this score -

Content analysis details:   (3.9 points, -10.0 required)

* pts rule name  description*
 --
--
 0.8 BAYES_50   BODY: Bayes spam probability is 40 to 60%
[score: 0.4292]
 0.0 HTML_MESSAGE   BODY: HTML included in message
 0.7 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
 0.4 HTML_MIME_NO_HTML_TAG  HTML-only message, but there is no HTML tag
 0.0 UNPARSEABLE_RELAY  Informational: message has unparseable relay
lines
 2.0 XPRIO  Has X-Priority header



Notice that none of the  other body tags are triggered.

Thanks,

Shivram