On Thu, Dec 17, 2009 at 01:53:05AM +0100, Stevan Baji?? wrote: > On Wed, 16 Dec 2009 16:53:16 -0600 > Kenneth Marshall <[email protected]> wrote: > > > On Thu, Dec 17, 2009 at 12:40:27AM +0200, Ibrahim Harrani wrote: > > > Hi, > > > > > > I have 5000 ham and 15 000 spam mails. I would like to train it before > > > using > > > dspam. > > > I usually train spam mail first then ham mail with the following commands. > > > Is it safe to train dspam like this. > > > or do I have to use dspam_train script? > > > If I remember correctly i used dspam_train but the script was learning > > > spam > > > mails as ham because of not enough pattern? > > > > > > > > > for spam: > > > for i in `/usr/bin/find /home/spam/ -type f` > > > do > > > echo $i; sudo /usr/local/bin/dspam --client --user > > > myglobaluser --class=spam --source=corpus --mode=teft < $i > > > > > > done > > > > > > For ham mails: > > > > > > for i in `/usr/bin/find /home/ham/ -type f` > > > do > > > echo $i; sudo /usr/local/bin/dspam --client --user > > > myglobaluser --class=ham--source=corpus --mode=teft < $i > > > > > > done > > > > > > PS: I started testing dspam 3.9RC2 on FreeBSD with PostgreSQL driver > > > support. > > > Hallo Ken, > > > > I just took a look at the dspam 3.9RC2 tools.pgsql schema definition > > and it is curiously lacking an index on token data. I would recommend > > something with (uid,token). Also adjust the fillfactor to allow HOT > > updates to the table. Finally a CLUSTER using the uid,token index > > should help locality of reference. > > > I am not a strong PostgreSQL user, so I don't know what should be added and > what not. If I am not wrong then this here "UNIQUE (uid, token)" is already > adding a index to dspam_token_data. So there is not need to add again the > same index. The only thing that one could add/change is the CLUSTER and/or > FILLFACTOR. If I am not wrong then the FILLFACTOR for btree indices is at 90. > Should that fillfactor be less or higher for DSPAM? What would you suggest? > > I think that most users will not benefit from a low FILLFACTOR. At least not > if we only index uid and token. Most setups will have at the beginning a > hunge insert rate for new uid, token combinations but then later the changes > to those two values will be fairly low. The only mechanism that is doing > changes on them is when new tokens for a uid get inserted and when the purge > script is removing old tokens. All other actions modify spam_hits, > innocent_hits and last_hit and those changes would not benefit anyway from > the default index. So FILLFACTOR has no effect on those updates. > > On my setup I use a merged group and this is the user getting the most > changes (because I do daily training of that user from my honeypot). But > normal users have rarely new tokens. For me a FILLFACTOR of 90, maybe even > higher is okay. I don't know how other setups are out there? Maybe they would > benefit from lower FILLFACTOR? Should I add a default of 80 for DSPAM? What > do you think? > -------------- Hi Stevan,
With the schema for dspam_token_data, a fillfactor of 90% will leave about 25 free tuple slots per page for HOT updates. I think 90% would be okay. Sorry about missing the UNIQUE(*) statement. That should create the appropriate index. You might be able to go to a 95% fillfactor but if it is not enough room for intermediate updates, the new updates will be added to the end of the table creating more random I/O. The CLUSTER is more useful for systems once they have reached a more steady state since it will improve the locality of reference for a users tokens which will enhance performance. I think clustering by that index is good, certainly for larger DSPAM setups. I have not commented sooner because other priorities have not allowed us to do the needed testing. Hopefully we can do some testing with RC2. Regards, Ken > ALTER INDEX dspam_token_data_uid_key SET (fillfactor = 80); > REINDEX INDEX dspam_token_data_uid_key; > -------------- > > Do you suggest to set CLUSTER by default for the index? Should I add the > following to the PostgreSQL schema: > -------------- > ALTER TABLE dspam_token_data CLUSTER ON dspam_token_data_uid_key; > -------------- > > > btw: Why so late with suggestions? Would have been nice to get those changes > much more in advance. We never heard anything from the DSPAM users using > PostgreSQL. And our bug/feature tracker was and is always open for such > things. > > btw2: If I am not wrong then CLUSTER is available for PostgreSQL >= 8.0. > Right? And since when is the FILLFACTOR available? I ask because maybe we > need to add those commands to a separate SQL file for PostgreSQL users having > newer versions instead of adding that to the default SQL schema. > > btw3: I do daily purging of tokens and force a reindex of the tables every > night after purging. I don't know how others maintain their DSPAM? But in > general I think we should not mess around to much with CLUSTER/FILLFACTOR. A > good DBA will anyway tweak his/her PostgreSQL instance to suit his/her needs. > Maybe I am wrong and we should set some sane defaults for those using > PostgreSQL but not knowing much about PostgreSQL? Yes, good defaults for less experienced users is a good idea. Ken ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
