On Wed, 16 Dec 2009 16:53:16 -0600 Kenneth Marshall <[email protected]> wrote:
> On Thu, Dec 17, 2009 at 12:40:27AM +0200, Ibrahim Harrani wrote: > > Hi, > > > > I have 5000 ham and 15 000 spam mails. I would like to train it before using > > dspam. > > I usually train spam mail first then ham mail with the following commands. > > Is it safe to train dspam like this. > > or do I have to use dspam_train script? > > If I remember correctly i used dspam_train but the script was learning spam > > mails as ham because of not enough pattern? > > > > > > for spam: > > for i in `/usr/bin/find /home/spam/ -type f` > > do > > echo $i; sudo /usr/local/bin/dspam --client --user > > myglobaluser --class=spam --source=corpus --mode=teft < $i > > > > done > > > > For ham mails: > > > > for i in `/usr/bin/find /home/ham/ -type f` > > do > > echo $i; sudo /usr/local/bin/dspam --client --user > > myglobaluser --class=ham--source=corpus --mode=teft < $i > > > > done > > > > PS: I started testing dspam 3.9RC2 on FreeBSD with PostgreSQL driver > > support. > Hallo Ken, > I just took a look at the dspam 3.9RC2 tools.pgsql schema definition > and it is curiously lacking an index on token data. I would recommend > something with (uid,token). Also adjust the fillfactor to allow HOT > updates to the table. Finally a CLUSTER using the uid,token index > should help locality of reference. > I am not a strong PostgreSQL user, so I don't know what should be added and what not. If I am not wrong then this here "UNIQUE (uid, token)" is already adding a index to dspam_token_data. So there is not need to add again the same index. The only thing that one could add/change is the CLUSTER and/or FILLFACTOR. If I am not wrong then the FILLFACTOR for btree indices is at 90. Should that fillfactor be less or higher for DSPAM? What would you suggest? I think that most users will not benefit from a low FILLFACTOR. At least not if we only index uid and token. Most setups will have at the beginning a hunge insert rate for new uid, token combinations but then later the changes to those two values will be fairly low. The only mechanism that is doing changes on them is when new tokens for a uid get inserted and when the purge script is removing old tokens. All other actions modify spam_hits, innocent_hits and last_hit and those changes would not benefit anyway from the default index. So FILLFACTOR has no effect on those updates. On my setup I use a merged group and this is the user getting the most changes (because I do daily training of that user from my honeypot). But normal users have rarely new tokens. For me a FILLFACTOR of 90, maybe even higher is okay. I don't know how other setups are out there? Maybe they would benefit from lower FILLFACTOR? Should I add a default of 80 for DSPAM? What do you think? -------------- ALTER INDEX dspam_token_data_uid_key SET (fillfactor = 80); REINDEX INDEX dspam_token_data_uid_key; -------------- Do you suggest to set CLUSTER by default for the index? Should I add the following to the PostgreSQL schema: -------------- ALTER TABLE dspam_token_data CLUSTER ON dspam_token_data_uid_key; -------------- btw: Why so late with suggestions? Would have been nice to get those changes much more in advance. We never heard anything from the DSPAM users using PostgreSQL. And our bug/feature tracker was and is always open for such things. btw2: If I am not wrong then CLUSTER is available for PostgreSQL >= 8.0. Right? And since when is the FILLFACTOR available? I ask because maybe we need to add those commands to a separate SQL file for PostgreSQL users having newer versions instead of adding that to the default SQL schema. btw3: I do daily purging of tokens and force a reindex of the tables every night after purging. I don't know how others maintain their DSPAM? But in general I think we should not mess around to much with CLUSTER/FILLFACTOR. A good DBA will anyway tweak his/her PostgreSQL instance to suit his/her needs. Maybe I am wrong and we should set some sane defaults for those using PostgreSQL but not knowing much about PostgreSQL? > Regards, > Ken > -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
