Re: perceptron and over-scoring (Re: Over-scoring of SURBL lists... )

2006-02-21 Thread Jeff Chan
On Monday, February 20, 2006, 12:39:31 PM, Theo Dinter wrote:

 Just for some info...  I went through the set1 spam logs for 3.1 score
 generation.

 1112804 total messages
  776108 messages hit SURBL
  138407 1 SURBL list(s) hit (1+ = 776108)
  189795 2 SURBL list(s) hit (2+ = 637701)
  281255 3 SURBL list(s) hit (3+ = 447906)
  136964 4 SURBL list(s) hit (4+ = 166651)
   29685 5 SURBL list(s) hit (5+ = 29687)
   2 6 SURBL list(s) hit (6+ = 2)

 The set1 ham logs:

 477629  total messages
   1023  messages hit SURBL
992  1 SURBL list(s) hit (1+ = 1023)
 23  2 SURBL list(s) hit (2+ = 31)
  5  3 SURBL list(s) hit (3+ = 8)
  3  4 SURBL list(s) hit (4+ = 3)
  0  5 SURBL list(s) hit (5+ = 0)
  0  6 SURBL list(s) hit (6+ = 0)


 So from these results, the FP rate is very low for SURBL (0.21%), and
 while there is a ton of overlap for spam (57.3%), there's very little
 for ham (0.01%).


Thank you for data.  They seem to support what we've been saying.

At a count of 138407, messages that hit only 1 SURBL are
significant, so lowering the scoring of a single list hit
significantly may result in significant FNs.

Cheers,

Jeff C.
-- 
Jeff Chan
mailto:[EMAIL PROTECTED]
http://www.surbl.org/



Re: perceptron and over-scoring (Re: Over-scoring of SURBL lists... )

2006-02-21 Thread Maurice Lucas
On Tue, 2006-02-21 at 06:53 -0800, Jeff Chan wrote:
 On Monday, February 20, 2006, 12:39:31 PM, Theo Dinter wrote:
 
  Just for some info...  I went through the set1 spam logs for 3.1 score
  generation.
 
  1112804 total messages
   776108 messages hit SURBL
   138407 1 SURBL list(s) hit (1+ = 776108)
   189795 2 SURBL list(s) hit (2+ = 637701)
   281255 3 SURBL list(s) hit (3+ = 447906)
   136964 4 SURBL list(s) hit (4+ = 166651)
29685 5 SURBL list(s) hit (5+ = 29687)
2 6 SURBL list(s) hit (6+ = 2)
 
  The set1 ham logs:
 
  477629  total messages
1023  messages hit SURBL
 992  1 SURBL list(s) hit (1+ = 1023)
  23  2 SURBL list(s) hit (2+ = 31)
   5  3 SURBL list(s) hit (3+ = 8)
   3  4 SURBL list(s) hit (4+ = 3)
   0  5 SURBL list(s) hit (5+ = 0)
   0  6 SURBL list(s) hit (6+ = 0)
 
 
  So from these results, the FP rate is very low for SURBL (0.21%), and
  while there is a ton of overlap for spam (57.3%), there's very little
  for ham (0.01%).
 
 
 Thank you for data.  They seem to support what we've been saying.
 
 At a count of 138407, messages that hit only 1 SURBL are
 significant, so lowering the scoring of a single list hit
 significantly may result in significant FNs.

But maybe we have to have a scoring like this
- current SURBL score if only on that list
- if on List1 and list2 then not a score of list1+list2 but more like a
basic SURBL score + fixed value
- if on List1 and list2 and list3 then not a score of list1+list2+list3
but more like a basic SURBL score + 2*(fixed value)

21% of all the SURBL hitting spam hit more then 4 list records. If this
where a FN (not very likely but possible) then the score would be to
high to compensate but if we use a scoring rule like above then the
score of a 4+ hiting spam message would be e.g.
basic SURBL score = 3
3*fixed value = 1
score = 6
and maybe with a SURBL list with very low FP score there could be a gain
in the fixed value score.

Maurice Lucas





Re: perceptron and over-scoring (Re: Over-scoring of SURBL lists... )

2006-02-20 Thread Theo Van Dinter
On Mon, Feb 20, 2006 at 07:38:42PM +, Justin Mason wrote:
 yes, I'm a little worried about that, too.

Just for some info...  I went through the set1 spam logs for 3.1 score
generation.

1112804 total messages
 776108 messages hit SURBL
 138407 1 SURBL list(s) hit (1+ = 776108)
 189795 2 SURBL list(s) hit (2+ = 637701)
 281255 3 SURBL list(s) hit (3+ = 447906)
 136964 4 SURBL list(s) hit (4+ = 166651)
  29685 5 SURBL list(s) hit (5+ = 29687)
  2 6 SURBL list(s) hit (6+ = 2)

The set1 ham logs:

477629  total messages
  1023  messages hit SURBL
   992  1 SURBL list(s) hit (1+ = 1023)
23  2 SURBL list(s) hit (2+ = 31)
 5  3 SURBL list(s) hit (3+ = 8)
 3  4 SURBL list(s) hit (4+ = 3)
 0  5 SURBL list(s) hit (5+ = 0)
 0  6 SURBL list(s) hit (6+ = 0)


So from these results, the FP rate is very low for SURBL (0.21%), and
while there is a ton of overlap for spam (57.3%), there's very little
for ham (0.01%).

-- 
Randomly Generated Tagline:
Winny and I lived in a house that ran on static electricity...
 If you wanted to run the blender, you had to rub balloons on your
 head... if you wanted to cook, you had to pull off a sweater real quick...
-- Steven Wright


pgptJCSaZiRLm.pgp
Description: PGP signature


Re: perceptron and over-scoring (Re: Over-scoring of SURBL lists... )

2006-02-20 Thread Justin Mason

Theo Van Dinter writes:
 On Mon, Feb 20, 2006 at 07:38:42PM +, Justin Mason wrote:
  yes, I'm a little worried about that, too.
 So from these results, the FP rate is very low for SURBL (0.21%), and
 while there is a ton of overlap for spam (57.3%), there's very little
 for ham (0.01%).

aha, that's very interesting!

--j.


Re: perceptron and over-scoring (Re: Over-scoring of SURBL lists... )

2006-02-20 Thread Raymond Dijkxhoorn

Hi!


On Mon, Feb 20, 2006 at 07:38:42PM +, Justin Mason wrote:



yes, I'm a little worried about that, too.



So from these results, the FP rate is very low for SURBL (0.21%), and
while there is a ton of overlap for spam (57.3%), there's very little
for ham (0.01%).



aha, that's very interesting!


And no surprise, we have been discussing this internally also and really 
see very few FP reports overlapping.


And please, if you DO get a FP, _report_ ! ([EMAIL PROTECTED])

Thanks!
Raymond.