Re: stats

2004-11-29 Thread Rupa Schomaker


On 11/29/2004 8:03 AM, Ronan wrote:
 just out of curiosity are there any other packages out there that could
 be useful? Im using exim + exiscan w/ SA

I use the following:

Package: amavis-stats
Description: generate rrd statistics from amavis* log
 amavis-stats is a simple amavis statistics generator based on rrdtool.
 Infection statistics are produced from amavis (sys)log entries and
 stored in rrd databases.

 The RRD files are created and updated by a perl script. Graphs are
 generated by a php script. Requires either rrdtool or php4-rrdtool.

You can see my home system's stats at:

http://www.rupa.com/amavis-stats/

Since I turned on greylisting the amount of spam/viruses actually
scanned has gone WAY down.  Hrrm.. Looks like on some historical graphs
the count is slammed to max...  dunno why.

and for postfix (not really appropriate for you):

Package: mailgraph
Description: Mail statistics RRDtool frontend for Postfix
 Mailgraph is a very simple mail statistics RRDtool frontend for Postfix
 that produces daily, weekly, monthly and yearly graphs of received/sent
 and bounced/rejected mail.

http://www.rupa.com/cgi-bin/mailgraph.cgi



 ronan

-- 
 -Rupa



Re: more spamassassin + bayes + postgres stuff

2004-11-19 Thread Rupa Schomaker
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


On 11/18/2004 3:38 PM, Michael Parker wrote:
 On Thu, Nov 18, 2004 at 06:53:19AM -0800, Rupa Schomaker wrote:
 
Some questions:

Is bytea really necessary?  If I follow the path of the patch, the bytea
change was done prior to adding the index.  Since the tokens are binary
data it is probably more correct through, especially if one has a
encoding other than SQL_ASCII set for the DB...
 
 
 Yes, as far as I can tell from the documentation.  The fact that we're
 storing the binary value makes it necessary.  If I'm misinformed, then
 feel free to point out where in the documentation.

My understanding is that isn't necessary but it is more fragile (subject
to the database encoding and the client encoding).  This was discussed
recently on one of the postgres groups... Looking:

http://groups.google.com/groups?hl=enlr=selm=cndnbc%24otp%241%40FreeBSD.csie.NCTU.edu.tw
Message-ID: [EMAIL PROTECTED]

===
From: Tom Lane ([EMAIL PROTECTED])
Subject: Re: [ADMIN] evil characters #bfef cause dump failure
Date: 2004-11-16 12:19:06 PST

[snip]
BTW, SQL_ASCII is not so much an encoding as the absence of any encoding
choice; it just passes 8-bit data with no interpretation.  So it's not
*that* unreasonable a default.  You can store UTF8 data in it without
any problem, you just won't have the niceties like detection of bad
character sequences.

   regards, tom lane
===

Leave it as bytea...

What do you use to benchmark changes?  I'm willing to experiment but
would like to have some reproducable results for ya...
 
 
 It's not really ready for real world consumption and time has been
 short for getting it ready.  You can read a little about it here:
 http://wiki.apache.org/spamassassin/BayesBenchmark
 
 Hopefully, I'll get some free time soon and get it into the SA tree.

I'll take a look at it when I get a chance.

Some more testing/observations with sa-learn only.  BTW: do you want me
to move this discussion to the ticket in bugzilla?  Or we can wait 'till
I/we have a summary...

General notes:

1) Why not a unique index that mimics the primary key (though do it in
token,id order not id,token)?  Won't matter in my case (since I run as
one user) and probably doen't matter at all unless running with lots 'n
lots of users...

2) bayes_seen.msgid should be type 'text' -- sa-learn (and others) don't
truncate to 200.

3) I also get differences in the backup file.

- -rw-r--r--  1 rupa users 13047214 Nov 18 13:23 backup_dbm.txt
- -rw-r--r--  1 rupa users 13047202 Nov 18 17:16 backup_new.txt

An actual diff is probably meaningless since I doubt order is guaranteed
between a dbm and sql.  I did the diff and quickly gave up.  I suppose
the data could be ordered from both sources and then compared?

Some 'benchmarks' of sa-learn.  Single run:

bayes_seen: 202863 rows
bayes_token: 150842 rows

System is:
model name  : AMD Athlon(tm) XP 2600+
MemTotal:  1031916 kB
debian unstable

With a fairly large workload from a memory standpoint but CPU generally
fairly idle.

Postgres hasn't been tuned much -- have to reset the stats in postgres
and do some analysis...

1) Shipped config with msgid='text' on my backup file:

real24m35.663s

2) Shipped config with indices added:

real32m33.931s

Ekk!  Analyze; delete; rerun:

Still 30min.

hrmmm..

But I know it runs better in normal operation.  Oh well *shrug* must be
the index update even though the check constraint doesn't need a table scan.

3) Patch (2004-10-31 18:53) applied, re-create tables:

real14m29.793s

Analyze, delete, rerun:

15m.

A bit better.

BTW: Using dbm the full restore takes 23s...

Time to add some small amount of stats to sa-learn (or underlying) to
see where we're spending time...  Added some more timing points and
dbg() output to SQL.pm.  Needs Time::HiRes which is bundled in perl
5.8.x but is an optional add-on for earlier stuff.

Ok, with my large set:

Token inserts start at around 1-2s per 1000 and rises to 7-8s per 1000.

Seen inserts start at around 1s per 1000 and stay there.

I can think of ways to optimize sa-learn (do it all in one TX rather
than 1TX per insert), assume an insert rather than using the generic
query then insert path for _put_token() but the restore is only done
once anyway and the changes would require some invasive changes rather
than just re-using existing logic  Not worth it.

It is however a reasonable test of the insert/update logic of learning a
single message (whether auto-learn or manual).  Doesn't test the query
side though...

 
 Michael

- --
 -Rupa

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBnYS/L3Aub+krmycRAuioAJ9bh224fxsAvUTX9liLQ1pf/wYIVACgxBDQ
SllANDuelO8OWEwqOWZ9FsM=
=1cIx
-END PGP SIGNATURE-



sa-learn --import with postgresql

2004-11-18 Thread Rupa Schomaker
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ran into a problem importing into postgres.  Running with -D didn't help
other than pinpointing to a problem while importing msgids.  postgres
logs showed:

2004-11-17 16:20:41 [14205] ERROR:  value too long for type character
varying(200)

The bayes_seen table has msgid as a varchar(200).

Changing it to 'test' fixed it for me.  Either spamassassin should
truncate or the underlying datatype should be larger or the error should
be handled better.  (import failed and deleted everything)

I didn't check behavior for learning a single message with a long msgid.

- --
 -Rupa

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBm/Y+L3Aub+krmycRAjinAJ9QT2GiloxiOJUGKj+LoApL4H107gCgna1I
E1rStCZD404TtTv6jnRtpMc=
=rHdq
-END PGP SIGNATURE-



Re: sa-learn --import with postgresql

2004-11-18 Thread Rupa Schomaker
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



Rupa Schomaker wrote:
[snip]

 2004-11-17 16:20:41 [14205] ERROR:  value too long for type character
 varying(200)
 
 The bayes_seen table has msgid as a varchar(200).
 
 Changing it to 'test' fixed it for me.  Either spamassassin should

err, changing to type 'text' fixed it for me.

[snip]

After the import, there were 5 rows with msgid  200:

4 like:
%RNDDIGIT36%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13...

1 like:
jughvuuvygvi5zRhsptNPX[lots of [EMAIL PROTECTED] of spaces]hotmail.com

Note that mysql truncates long values silently so is not affected by
this.  Other databases most probably behave like postgres.

- --
 -Rupa

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBm/p5L3Aub+krmycRAl/3AJ9T1C8Rm7EnMaFdbHQQHkbPbghJiQCeJJAD
0kPV1gOlw1AB0ffIDDVJgJE=
=sLKr
-END PGP SIGNATURE-