Re: [Dspam-user] train method

Stevan Bajić Wed, 16 Dec 2009 17:01:44 -0800

On Wed, 16 Dec 2009 16:53:16 -0600
Kenneth Marshall <[email protected]> wrote:

> On Thu, Dec 17, 2009 at 12:40:27AM +0200, Ibrahim Harrani wrote:
> > Hi,
> > 
> > I have 5000 ham and 15 000 spam mails. I would like to train it before using
> > dspam.
> > I usually train spam mail first then ham mail with the following commands.
> > Is it safe to train dspam like this.
> > or do I have to use dspam_train script?
> > If I remember correctly i used dspam_train but the script was learning spam
> > mails as ham because of not enough pattern?
> > 
> > 
> > for spam:
> > for i in `/usr/bin/find /home/spam/ -type f`
> >         do
> >                 echo $i; sudo /usr/local/bin/dspam --client --user
> > myglobaluser --class=spam --source=corpus --mode=teft < $i
> > 
> >       done
> > 
> > For ham mails:
> > 
> > for i in `/usr/bin/find /home/ham/ -type f`
> >         do
> >                 echo $i; sudo /usr/local/bin/dspam --client --user
> > myglobaluser --class=ham--source=corpus --mode=teft < $i
> > 
> >       done
> > 
> > PS: I started testing dspam 3.9RC2 on FreeBSD with PostgreSQL driver
> > support.
> 
Hallo Ken,

> I just took a look at the dspam 3.9RC2 tools.pgsql schema definition
> and it is curiously lacking an index on token data. I would recommend
> something with (uid,token). Also adjust the fillfactor to allow HOT
> updates to the table. Finally a CLUSTER using the uid,token index
> should help locality of reference.
> 
I am not a strong PostgreSQL user, so I don't know what should be added and 
what not. If I am not wrong then this here "UNIQUE (uid, token)" is already 
adding a index to dspam_token_data. So there is not need to add again the same 
index. The only thing that one could add/change is the CLUSTER and/or 
FILLFACTOR. If I am not wrong then the FILLFACTOR for btree indices is at 90. 
Should that fillfactor be less or higher for DSPAM? What would you suggest?

I think that most users will not benefit from a low FILLFACTOR. At least not if 
we only index uid and token. Most setups will have at the beginning a hunge 
insert rate for new uid, token combinations but then later the changes to those 
two values will be fairly low. The only mechanism that is doing changes on them 
is when new tokens for a uid get inserted and when the purge script is removing 
old tokens. All other actions modify spam_hits, innocent_hits and last_hit and 
those changes would not benefit anyway from the default index. So FILLFACTOR 
has no effect on those updates.

On my setup I use a merged group and this is the user getting the most changes 
(because I do daily training of that user from my honeypot). But normal users 
have rarely new tokens. For me a FILLFACTOR of 90, maybe even higher is okay. I 
don't know how other setups are out there? Maybe they would benefit from lower 
FILLFACTOR? Should I add a default of 80 for DSPAM? What do you think?
--------------
ALTER INDEX dspam_token_data_uid_key SET (fillfactor = 80);
REINDEX INDEX dspam_token_data_uid_key;
--------------

Do you suggest to set CLUSTER by default for the index? Should I add the 
following to the PostgreSQL schema:
--------------
ALTER TABLE dspam_token_data CLUSTER ON dspam_token_data_uid_key;
--------------

btw: Why so late with suggestions? Would have been nice to get those changes 
much more in advance. We never heard anything from the DSPAM users using 
PostgreSQL. And our bug/feature tracker was and is always open for such things.

btw2: If I am not wrong then CLUSTER is available for PostgreSQL >= 8.0. Right? 
And since when is the FILLFACTOR available? I ask because maybe we need to add 
those commands to a separate SQL file for PostgreSQL users having newer 
versions instead of adding that to the default SQL schema.

btw3: I do daily purging of tokens and force a reindex of the tables every 
night after purging. I don't know how others maintain their DSPAM? But in 
general I think we should not mess around to much with CLUSTER/FILLFACTOR. A 
good DBA will anyway tweak his/her PostgreSQL instance to suit his/her needs. 
Maybe I am wrong and we should set some sane defaults for those using 
PostgreSQL but not knowing much about PostgreSQL?

> Regards,
> Ken
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] train method

Reply via email to