Re: Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-26 Thread Justin Mason
btw guys, note that hit-frequencies can also produce rule-overlap reports using
the "-o" switch

--j.

On Tue, May 26, 2009 at 00:57, Mandy  wrote:
> On Fri, May 22, 2009 at 9:06 PM, Henrik K  wrote:
>> On Fri, May 22, 2009 at 09:28:55PM +0200, Karsten Bräckelmann wrote:
>>> > The EmailBL test zone period has been extended to July 1st.
>
> [snip]
>
>> Thanks. And this is just a small scale test. If we used more domains, feeds,
>> and submissions, it could be even nicer. ;-) Keep the reports coming in. It
>> would be nice to also know how much of spam are generally from freemails, so
>> FREEMAIL_FROM/BODY/REPLYTO figures would be nice also when reporting. It
>> might differ from user to user.
>
> I just spent some time putting together some stats.  I'm going to try
> to follow the excellent lead of Karsten, and provide some overlap
> figures based on the cool grep formula that Dan Mcdonald showed.  The
> short version is that it hits about 12% of spam scoring under 15.
>
> The time period is somewhat short: May 22 to May 25.  It's a little
> inaccurate too, due to 12 hours of extra mail in the May 22 side
> because I implemented at noon, but...
>
> As I mentioned before, this is from a mid-sized install of Canadian
> government & education users (somewhere around 100 000 mailboxes).  SA
> only sees a filtered mail-stream in my setup -- to give an idea how
> filtered, 75% of the mail that SA sees is classified as ham.  The
> totals volumes were 192 530 Spam, 564 483 Ham.
>
>
> 24.5% of the spam that's tagged is between 5 & 10 score.
> 2.76% of that mail hit EMAILBL_TEST_LEM.
> 0.95% hit FREEMAIL_REPLYTO
>
> 22.9% of the spam that's tagged is between 10 and 15.
> 8.97% of that mail hit EMAILBL_TEST_LEM.
> 1.20% hit FREEMAIL_REPLYTO
>
> 52.5% of the spam that's tagged is above 15.
> 21.41% of that mail hit EMAILBL_TEST_LEM.
> 2.36% hit FREEMAIL_REPLYTO
>
> I also saw 0.05% hits of EMAILBL_TEST_LEM on mail classified as ham.
> I hand-verified the 35 messages of 299 that weren't obvious spam.
> About 9 of those were FPs (and those came down to 3 distinct messages
> from lists I sure wouldn't choose to be on).  I can provide them
> off-list if desired.
>
> I saw even fewer FREEMAIL_REPLYTO hits on mail classified as ham.  56,
> or 0.01%.  About 22 of those (based on subject line -- sorry it's the
> end of the day) look legit.
>
> Here are the overlap numbers for mail with score less than 10:
> $ grep EMAILBL_TEST_LEM spamd_since_22nd | perl -ne 'if (/spamd:
> result: Y (\d+)/) { print if $1 <= 10 }' | cut -d' ' -f11 | egrep -o
> '[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15
>   1304 EMAILBL_TEST_LEM,
>    728 RAZOR2_CHECK,
>    643 RAZOR2_CF_RANGE_51_100,
>    629 RAZOR2_CF_RANGE_E4_51_100,
>    612 BAYES_50,
>    590 FORGED_YAHOO_RCVD,
>    582 BAYES_99,
>    282 HTML_MESSAGE,
>    199 FREEMAIL_FROM,
>    157 ADVANCE_FEE_2,
>    132 FORGED_MUA_OUTLOOK,
>    114 FREEMAIL_REPLYTO,
>    103 RCVD_IN_BRBL,
>     72 SPF_PASS,
>
> And here they are for all hits on EMAILBL_TEST_LEM:
> $ grep EMAILBL_TEST_LEM spamd_since_22nd | cut -d' ' -f11 | egrep -o
> '[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15
>  41503 EMAILBL_TEST_LEM,
>  38987 BAYES_99,
>  36782 FORGED_MUA_OUTLOOK,
>  36028 ADVANCE_FEE_2,
>  33746 RCVD_IN_BRBL,
>  33506 JM_SOUGHT_FRAUD_3,
>  33214 JM_SOUGHT_FRAUD_2,
>  33186 HTML_MESSAGE,
>  32281 RCVD_IN_BL_SPAMCOP_NET,
>  31953 JM_SOUGHT_FRAUD_1,
>  31914 RDNS_NONE,
>  31893 RCVD_IN_SBL,
>  31883 MIME_HTML_ONLY,
>
> Phew.  Hopefully those numbers are useful.
>
>


Re: Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-25 Thread Mandy
On Fri, May 22, 2009 at 9:06 PM, Henrik K  wrote:
> On Fri, May 22, 2009 at 09:28:55PM +0200, Karsten Bräckelmann wrote:
>> > The EmailBL test zone period has been extended to July 1st.

[snip]

> Thanks. And this is just a small scale test. If we used more domains, feeds,
> and submissions, it could be even nicer. ;-) Keep the reports coming in. It
> would be nice to also know how much of spam are generally from freemails, so
> FREEMAIL_FROM/BODY/REPLYTO figures would be nice also when reporting. It
> might differ from user to user.

I just spent some time putting together some stats.  I'm going to try
to follow the excellent lead of Karsten, and provide some overlap
figures based on the cool grep formula that Dan Mcdonald showed.  The
short version is that it hits about 12% of spam scoring under 15.

The time period is somewhat short: May 22 to May 25.  It's a little
inaccurate too, due to 12 hours of extra mail in the May 22 side
because I implemented at noon, but...

As I mentioned before, this is from a mid-sized install of Canadian
government & education users (somewhere around 100 000 mailboxes).  SA
only sees a filtered mail-stream in my setup -- to give an idea how
filtered, 75% of the mail that SA sees is classified as ham.  The
totals volumes were 192 530 Spam, 564 483 Ham.


24.5% of the spam that's tagged is between 5 & 10 score.
2.76% of that mail hit EMAILBL_TEST_LEM.
0.95% hit FREEMAIL_REPLYTO

22.9% of the spam that's tagged is between 10 and 15.
8.97% of that mail hit EMAILBL_TEST_LEM.
1.20% hit FREEMAIL_REPLYTO

52.5% of the spam that's tagged is above 15.
21.41% of that mail hit EMAILBL_TEST_LEM.
2.36% hit FREEMAIL_REPLYTO

I also saw 0.05% hits of EMAILBL_TEST_LEM on mail classified as ham.
I hand-verified the 35 messages of 299 that weren't obvious spam.
About 9 of those were FPs (and those came down to 3 distinct messages
from lists I sure wouldn't choose to be on).  I can provide them
off-list if desired.

I saw even fewer FREEMAIL_REPLYTO hits on mail classified as ham.  56,
or 0.01%.  About 22 of those (based on subject line -- sorry it's the
end of the day) look legit.

Here are the overlap numbers for mail with score less than 10:
$ grep EMAILBL_TEST_LEM spamd_since_22nd | perl -ne 'if (/spamd:
result: Y (\d+)/) { print if $1 <= 10 }' | cut -d' ' -f11 | egrep -o
'[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15
   1304 EMAILBL_TEST_LEM,
728 RAZOR2_CHECK,
643 RAZOR2_CF_RANGE_51_100,
629 RAZOR2_CF_RANGE_E4_51_100,
612 BAYES_50,
590 FORGED_YAHOO_RCVD,
582 BAYES_99,
282 HTML_MESSAGE,
199 FREEMAIL_FROM,
157 ADVANCE_FEE_2,
132 FORGED_MUA_OUTLOOK,
114 FREEMAIL_REPLYTO,
103 RCVD_IN_BRBL,
 72 SPF_PASS,

And here they are for all hits on EMAILBL_TEST_LEM:
$ grep EMAILBL_TEST_LEM spamd_since_22nd | cut -d' ' -f11 | egrep -o
'[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15
  41503 EMAILBL_TEST_LEM,
  38987 BAYES_99,
  36782 FORGED_MUA_OUTLOOK,
  36028 ADVANCE_FEE_2,
  33746 RCVD_IN_BRBL,
  33506 JM_SOUGHT_FRAUD_3,
  33214 JM_SOUGHT_FRAUD_2,
  33186 HTML_MESSAGE,
  32281 RCVD_IN_BL_SPAMCOP_NET,
  31953 JM_SOUGHT_FRAUD_1,
  31914 RDNS_NONE,
  31893 RCVD_IN_SBL,
  31883 MIME_HTML_ONLY,

Phew.  Hopefully those numbers are useful.


Re: Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-23 Thread Karsten Bräckelmann
Sorry, quoting self.

> > > An interesting observation is, that the hitrate (in percent) in spam
> > > scoring < 15 is an order of magnitude higher than with high-scoring [1]
> > > spam. This is rare to find...

> That's limited to EmailBL hits, so the total of these hits equal 100%.
> For me that would have been:
> 
>   19.4%  of mail hitting EmailBL has a score < 15
>   80.6%  of mail hitting EmailBL has a score > 15

Oh, and EmailBL hits only a mere 1.02% of *all* my spam anyway. That's
poor, isn't it?

No, it is not! :)  Because it identifies a whopping 10.9% of the spam,
that isn't already branded by all those existing tests.


> And that's what counts in my book. I don't care if the lions share of
> EmailBL hits are actually high scorers. Those don't need a boost anyway.
> What I do care about are hits in the sneaky-ish crap. And that's where
> it hits on more than 10%.

I don't write rules to target high-scoring spam either. I look at sneaky
spam, and write rules to identify those. If it also hits on a lot of
high scorers, fine. But I couldn't care less, if it does -- as long as
it identifies those going under the radar of a score 15 cutoff.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-23 Thread Karsten Bräckelmann
On Sat, 2009-05-23 at 11:26 -0500, Larry Nedry wrote:
> On 5/22/09 at 9:28 PM +0200 Karsten Bräckelmann wrote:
> >An interesting observation is, that the hitrate (in percent) in spam
> >scoring < 15 is an order of magnitude higher than with high-scoring [1]
> >spam. This is rare to find...
> 
> My EMAILBL_TEST_LEM hitrate leans heavily toward the other end of the
> spectrum with almost 88% scoring > 15.  My data is based on a little more
> than 100,000 emails.

Wait, you're looking at the hits differently than I did.

> Stats for only messages tagged with EMAILBL_TEST_LEM:
> 
> 04.5% scored 00.0 - 05.0
> 03.0% scored 05.0 - 10.0
> 04.5% scored 10.0 - 15.0
> 09.1% scored 15.0 - 20.0
> 78.8% scored 20.0 or higher

That's limited to EmailBL hits, so the total of these hits equal 100%.
For me that would have been:

  19.4%  of mail hitting EmailBL has a score < 15
  80.6%  of mail hitting EmailBL has a score > 15

However, a score > 15 is more than 98.5% of my spam. Taking that into
account, the numbers change drastically. That's what I reported. Less
than 1% hits in ALL spam with a total score of 15 or higher.

Yet, 10.9% hits in ALL spam with a score less than 15.

And that's what counts in my book. I don't care if the lions share of
EmailBL hits are actually high scorers. Those don't need a boost anyway.
What I do care about are hits in the sneaky-ish crap. And that's where
it hits on more than 10%.


Larry, what numbers do you get, if you count hits in ALL your spam
in-stream, broken down by scores?

  guenther


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-23 Thread Larry Nedry
On 5/22/09 at 9:28 PM +0200 Karsten Bräckelmann wrote:
>An interesting observation is, that the hitrate (in percent) in spam
>scoring < 15 is an order of magnitude higher than with high-scoring [1]
>spam. This is rare to find...

My EMAILBL_TEST_LEM hitrate leans heavily toward the other end of the
spectrum with almost 88% scoring > 15.  My data is based on a little more
than 100,000 emails.

EMAILBL_TEST_LEM stats for all messages passed through Spamassassin:
- hit 2.00% of all email tagged as spam.
- hit 0.04% of all email tagged as ham.

There were no false positives and the messages tagged as ham were false
negatives.  If I had given EMAILBL_TEST_LEM a score of 2.0 instead of its
current 0.001 all but one of the FNs would have been properly tagged as
spam.

Stats for only messages tagged with EMAILBL_TEST_LEM:

04.5% scored 00.0 - 05.0
03.0% scored 05.0 - 10.0
04.5% scored 10.0 - 15.0
09.1% scored 15.0 - 20.0
78.8% scored 20.0 or higher

Nedry


Re: Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-23 Thread Chris
On Sat, 2009-05-23 at 07:06 +0300, Henrik K wrote:

> Thanks. And this is just a small scale test. If we used more domains, feeds,
> and submissions, it could be even nicer. ;-) Keep the reports coming in. It
> would be nice to also know how much of spam are generally from freemails, so
> FREEMAIL_FROM/BODY/REPLYTO figures would be nice also when reporting. It
> might differ from user to user.
> 
Freemail stats from 3 May through yesterday:

Rule Name Score Ham   Spam   %of Ham   %of Spam

---
  FREEMAIL_REPLYTO2.00  1 46 0.28% 21.20%
  FREEMAIL_FROM   0.50  7 87 1.97% 40.09%

---
  OVERALL   7 90 1.97% 41.47%

-- 
KeyID 0xE372A7DA98E6705C



signature.asc
Description: This is a digitally signed message part


Re: Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-22 Thread Henrik K
On Fri, May 22, 2009 at 09:28:55PM +0200, Karsten Bräckelmann wrote:
> > The EmailBL test zone period has been extended to July 1st.
> 
> As promised, here are some results from me, now that I got some half-
> decent spam throughput. Not an ISP, not a company. Have been running the
> original cf for 5 days, then updated. Since then another 5 days passed.
> 
>8.7%  hits in spam scoring 05-10
>   11.5%  hits in spam scoring 10-15
> 
>   Hit 1 out of 3 FNs (spam, score < 5).  No hits in ham.
> 
> Overall hitrate. Numbers are even better just looking at the last 5 days
> worth, after the update. Less than 1% hits in the spam scoring >= 15,
> but that's entirely perfect. :)
> 
> About half of the hits in spam < 15 also are caught by SOUGHT_FRAUD.
> That overlap was to be expected.
> 
> An interesting observation is, that the hitrate (in percent) in spam
> scoring < 15 is an order of magnitude higher than with high-scoring [1]
> spam. This is rare to find...
> 
> 
> Looks really good to me, guys! Great job.  Can I keep it? :)

Thanks. And this is just a small scale test. If we used more domains, feeds,
and submissions, it could be even nicer. ;-) Keep the reports coming in. It
would be nice to also know how much of spam are generally from freemails, so
FREEMAIL_FROM/BODY/REPLYTO figures would be nice also when reporting. It
might differ from user to user.



Stats (was: The EmailBL test zone period has been extended to July 1st.)

2009-05-22 Thread Karsten Bräckelmann
> The EmailBL test zone period has been extended to July 1st.

As promised, here are some results from me, now that I got some half-
decent spam throughput. Not an ISP, not a company. Have been running the
original cf for 5 days, then updated. Since then another 5 days passed.

   8.7%  hits in spam scoring 05-10
  11.5%  hits in spam scoring 10-15

  Hit 1 out of 3 FNs (spam, score < 5).  No hits in ham.

Overall hitrate. Numbers are even better just looking at the last 5 days
worth, after the update. Less than 1% hits in the spam scoring >= 15,
but that's entirely perfect. :)

About half of the hits in spam < 15 also are caught by SOUGHT_FRAUD.
That overlap was to be expected.

An interesting observation is, that the hitrate (in percent) in spam
scoring < 15 is an order of magnitude higher than with high-scoring [1]
spam. This is rare to find...


Looks really good to me, guys! Great job.  Can I keep it? :)

  guenther


[1] Which accounts for the bulk of my spam with > 98.5%. The goal is to
find ways to score the remaining 1.5%.

-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}