Re: [Dspam-user] train method

Kenneth Marshall Wed, 16 Dec 2009 19:51:31 -0800

On Thu, Dec 17, 2009 at 01:53:05AM +0100, Stevan Baji?? wrote:
> On Wed, 16 Dec 2009 16:53:16 -0600
> Kenneth Marshall <[email protected]> wrote:
> 
> > On Thu, Dec 17, 2009 at 12:40:27AM +0200, Ibrahim Harrani wrote:
> > > Hi,
> > > 
> > > I have 5000 ham and 15 000 spam mails. I would like to train it before 
> > > using
> > > dspam.
> > > I usually train spam mail first then ham mail with the following commands.
> > > Is it safe to train dspam like this.
> > > or do I have to use dspam_train script?
> > > If I remember correctly i used dspam_train but the script was learning 
> > > spam
> > > mails as ham because of not enough pattern?
> > > 
> > > 
> > > for spam:
> > > for i in `/usr/bin/find /home/spam/ -type f`
> > >         do
> > >                 echo $i; sudo /usr/local/bin/dspam --client --user
> > > myglobaluser --class=spam --source=corpus --mode=teft < $i
> > > 
> > >       done
> > > 
> > > For ham mails:
> > > 
> > > for i in `/usr/bin/find /home/ham/ -type f`
> > >         do
> > >                 echo $i; sudo /usr/local/bin/dspam --client --user
> > > myglobaluser --class=ham--source=corpus --mode=teft < $i
> > > 
> > >       done
> > > 
> > > PS: I started testing dspam 3.9RC2 on FreeBSD with PostgreSQL driver
> > > support.
> > 
> Hallo Ken,
> 
> 
> > I just took a look at the dspam 3.9RC2 tools.pgsql schema definition
> > and it is curiously lacking an index on token data. I would recommend
> > something with (uid,token). Also adjust the fillfactor to allow HOT
> > updates to the table. Finally a CLUSTER using the uid,token index
> > should help locality of reference.
> > 
> I am not a strong PostgreSQL user, so I don't know what should be added and 
> what not. If I am not wrong then this here "UNIQUE (uid, token)" is already 
> adding a index to dspam_token_data. So there is not need to add again the 
> same index. The only thing that one could add/change is the CLUSTER and/or 
> FILLFACTOR. If I am not wrong then the FILLFACTOR for btree indices is at 90. 
> Should that fillfactor be less or higher for DSPAM? What would you suggest?
> 
> I think that most users will not benefit from a low FILLFACTOR. At least not 
> if we only index uid and token. Most setups will have at the beginning a 
> hunge insert rate for new uid, token combinations but then later the changes 
> to those two values will be fairly low. The only mechanism that is doing 
> changes on them is when new tokens for a uid get inserted and when the purge 
> script is removing old tokens. All other actions modify spam_hits, 
> innocent_hits and last_hit and those changes would not benefit anyway from 
> the default index. So FILLFACTOR has no effect on those updates.
> 
> On my setup I use a merged group and this is the user getting the most 
> changes (because I do daily training of that user from my honeypot). But 
> normal users have rarely new tokens. For me a FILLFACTOR of 90, maybe even 
> higher is okay. I don't know how other setups are out there? Maybe they would 
> benefit from lower FILLFACTOR? Should I add a default of 80 for DSPAM? What 
> do you think?
> --------------
Hi Stevan,


With the schema for dspam_token_data, a fillfactor of 90% will leave about 25 
free tuple
slots per page for HOT updates. I think 90% would be okay. Sorry about missing 
the UNIQUE(*)
statement. That should create the appropriate index. You might be able to go to 
a 95% fillfactor
but if it is not enough room for intermediate updates, the new updates will be 
added to the
end of the table creating more random I/O. The CLUSTER is more useful for 
systems once they
have reached a more steady state since it will improve the locality of 
reference for a users
tokens which will enhance performance. I think clustering by that index is 
good, certainly
for larger DSPAM setups. I have not commented sooner because other priorities 
have not allowed
us to do the needed testing. Hopefully we can do some testing with RC2.

Regards,
Ken

> ALTER INDEX dspam_token_data_uid_key SET (fillfactor = 80);
> REINDEX INDEX dspam_token_data_uid_key;
> --------------
> 
> Do you suggest to set CLUSTER by default for the index? Should I add the 
> following to the PostgreSQL schema:
> --------------
> ALTER TABLE dspam_token_data CLUSTER ON dspam_token_data_uid_key;
> --------------
> 
> 
> btw: Why so late with suggestions? Would have been nice to get those changes 
> much more in advance. We never heard anything from the DSPAM users using 
> PostgreSQL. And our bug/feature tracker was and is always open for such 
> things.
> 
> btw2: If I am not wrong then CLUSTER is available for PostgreSQL >= 8.0. 
> Right? And since when is the FILLFACTOR available? I ask because maybe we 
> need to add those commands to a separate SQL file for PostgreSQL users having 
> newer versions instead of adding that to the default SQL schema.
> 
> btw3: I do daily purging of tokens and force a reindex of the tables every 
> night after purging. I don't know how others maintain their DSPAM? But in 
> general I think we should not mess around to much with CLUSTER/FILLFACTOR. A 
> good DBA will anyway tweak his/her PostgreSQL instance to suit his/her needs. 
> Maybe I am wrong and we should set some sane defaults for those using 
> PostgreSQL but not knowing much about PostgreSQL?

Yes, good defaults for less experienced users is a good idea.

Ken

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] train method

Reply via email to