RE: can any body help me understand this

2004-12-16 Thread Chris Santerre


>-Original Message-
>From: Matt Kettler [mailto:[EMAIL PROTECTED]
>Sent: Thursday, December 16, 2004 3:32 PM
>To: Chris Santerre; users@spamassassin.apache.org
>Subject: RE: can any body help me understand this
>
>
>At 02:55 PM 12/16/2004, Chris Santerre wrote:
>>YAMPTSBOTW!
>>
>>Yet Another Matt Post That Should Be On The Wiki!!
>
>Not really... as of SA 3.0 you can't read the tokens anymore.. 
>they are 
>SHA1 hashed before dumping into the database. Higher privacy, 
>and faster 
>searching, but less useful for debugging. 

OK, stop it. Now you're just showing off ;)

--Chris (Goes to google for a refresh on SHA1..)


Re: can any body help me understand this

2004-12-16 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Chris Santerre writes:
> >As for the dump output..
> 
> YAMPTSBOTW!
> 
> Yet Another Matt Post That Should Be On The Wiki!!
> 
> ;)

I know, the man's a one-man FAQ ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBweuUMJF5cimLx9ARAgqCAJ916QoWcM4y6jVi+Iu3WpBGay8bTwCdHy1m
L7R4q45MOt83XMP5nn5p5h0=
=EzS6
-END PGP SIGNATURE-



RE: can any body help me understand this

2004-12-16 Thread Matt Kettler
At 02:55 PM 12/16/2004, Chris Santerre wrote:
YAMPTSBOTW!
Yet Another Matt Post That Should Be On The Wiki!!
Not really... as of SA 3.0 you can't read the tokens anymore.. they are 
SHA1 hashed before dumping into the database. Higher privacy, and faster 
searching, but less useful for debugging. 



RE: can any body help me understand this

2004-12-16 Thread Chris Santerre


>-Original Message-
>From: Matt Kettler [mailto:[EMAIL PROTECTED]
>Sent: Thursday, December 16, 2004 10:33 AM
>To: Rakesh; users@spamassassin.apache.org
>Subject: Re: can any body help me understand this
>
>
>At 04:29 PM 12/16/2004 +0530, Rakesh wrote:
>>  did a sa-learn --dump data and got an output of the 
>following kind. Can 
>> any one please help me understand the output.
>
>
>The dump output is pretty simple.. The token format gets a bit 
>complicated, 
>but even that isn't too bad.
>
>As for the dump output..

YAMPTSBOTW!

Yet Another Matt Post That Should Be On The Wiki!!

;)

--Chris 


Re: can any body help me understand this

2004-12-16 Thread snowjack
Kang, Joseph S. wrote:
As for the dump output..
0.000  0108 1103190407  N:H*i:sk:NNfNNNc

[snipped for brevity]
The fourth is the token itself. SA uses some "prefix" characters for 
encoding things, but without any prefix, a token is a word in 
the body of 
the message.

I think you meant the FIFTH column is the token itself, right?
-Joe K.
Agreed, I think the fourth field is the timestamp at which the token was 
last seen in a message, and the fifth field is the token. The 
timestamp's used for auto-expiry runs where tokens that haven't been 
seen in a while are removed from the db if other expiration requirements 
are met.




RE: can any body help me understand this

2004-12-16 Thread Kang, Joseph S.
> As for the dump output..
> >0.000  0108 1103190407  N:H*i:sk:NNfNNNc
> 
[snipped for brevity]
> The fourth is the token itself. SA uses some "prefix" characters for 
> encoding things, but without any prefix, a token is a word in 
> the body of 
> the message.

I think you meant the FIFTH column is the token itself, right?

-Joe K.


Re: can any body help me understand this

2004-12-16 Thread Matt Kettler
At 04:29 PM 12/16/2004 +0530, Rakesh wrote:
 did a sa-learn --dump data and got an output of the following kind. Can 
any one please help me understand the output.

The dump output is pretty simple.. The token format gets a bit complicated, 
but even that isn't too bad.

As for the dump output..
0.000  0108 1103190407  N:H*i:sk:NNfNNNc

The first column is a spam probability.. 0.999 means 99.9% probability of 
this token appearing in spam, 0.000 means 0 percent. (note: this is just 
probability for ONE token. SA does a chi-squared combine of these numbers 
to figure out the overall probability of the whole message)

The second column is the number of times bayes has been trained on a spam 
message containing the token.

The third column is the number of times bayes has been trained on a nonspam 
message containing the token.

The fourth is the token itself. SA uses some "prefix" characters for 
encoding things, but without any prefix, a token is a word in the body of 
the message.

Now this leads into how do all these prefixes work and what do they mean...
First, some token format prefixes:
-
N: means there are numbers in the token represented by N's (thus allowing 
match of anything 0-9)
sk: means "skip" ie: the token can have other charachters leading up to it, 
and does not need to be the start of a "word"

Now some "where the token must appear" prefixes:
-
U* indicates it's the username part of an email address
D* indicates it's the domain part of an email address
H indicates the token must be in a header a header. It can be followed by a 
literal header name (ie: HTo:), or one of the following "short cuts"

%HEADER_NAME_COMPRESSION = (
  'Message-Id'  => '*m',
  'Message-ID'  => '*M',
  'Received'=> '*r',
  'User-Agent'  => '*u',
  'References'  => '*f',
  'In-Reply-To' => '*i',
  'From'=> '*F',
  'Reply-To'=> '*R',
  'Return-Path' => '*p',
  'X-Mailer'=> '*x',
  'X-Authentication-Warning' => '*a',
  'Organization'=> '*o',
  'Organisation'=> '*o',
  'Content-Type'=> '*c',
);

So let's translate a few.

0.000  0108 1103190407  N:H*i:sk:NNfNNNc
Probaility 0%, if the "In-Reply-To" header contains a numeric 
pattern  NNfNNNc (ie: 00f000c thru 99f999c). The token may in the middle of 
a "word" and does not need to have a whitespace or word boundary before it.

0.978  2  0 1103188668  UNLIKE
97.8% chance of the all-caps word "UNLIKE" appearing in the body of spam.
0.009  0  6 1102997003  U*sambalpur
0.9% chance of spam if there is an email address with the username 
"sambalpur@" in the message

0.958  1  0 1103003309  H*M:OEBfa62
95.8% chance of spam if the Message-ID header has a word starting with 
OEBfa62

0.958  1  0 1103171817  Tins
95.8% chance of spam if the word "Tins" appears in the body.
etc...
0.049  0  1 1102985500  D*ms52.hinet.net
0.013219  25539 1103193138  H*r:Unix
0.027 31   1717 1103192325  N:HX-Qmail-Scanner:N.NN
0.467123219 1103186329  PERSONAL
0.013  0  4 1103027319  HTo:U*Jesrine
0.985  3  0 1103099578  backfiring
0.017  0  3 1103031379  YÓk
0.049  0  1 1102972766  Wspecial
0.958  1  0 1102981540  sk:QHKBAZC



Re: can any body help me understand this

2004-12-16 Thread Robert Menschel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello Rakesh,

Thursday, December 16, 2004, 2:59:52 AM, you wrote:

R> These days my bayesian engine is giving me a lots of false
R> positive, ...
R> However now i am trying to investigate whether my Bayes is really
R> poisoned or not. ... 

R> Also if my Bayes is poisoned can i safely replace the existing
R> bayes db of this server with one of my another server as right now
R> spams over there are being properly trapped.

IMO the results are what matters, not the statistics. If you're
getting any significant number of spam with BAYES_00 or any
significant number of non-spam with BAYES_99, then yes, your Bayes
database is poisoned.

Yes, if you can quiesce that other server's SA system and copy the
quiet (not being updated) Bayes to this server (assuming same
version,
same software pre-reqs satisfied).

Though I have three systems and could do this also I've never
bothered. On those (rare) occasions when I've had to wipe a Bayes
database (generally because I've poisoned the database myself due to
EBKAC error), I've simply deleted the files and then done a manual
sa-learn with 1k each of known and verified recent spam and non-spam.

Bob Menschel

-BEGIN PGP SIGNATURE-
Version: PGP 8.0

iQA/AwUBQcGov5ebK8E4qh1HEQIPXwCgqCeEdNUB+liz63W2aTgZFSh/FIUAn24x
QEZPa/buHiUE+LiMPP/g8Mie
=EYQN
-END PGP SIGNATURE-