On Sun, 14 Feb 2010 11:49:20 +0000 Kārlis Repsons <[email protected]> wrote:
> People, > Hallo Kārlis, > I know it depends on quite many factors in total, but anyway, could we make a > small list of values and info in here like this: > what do you mean? We all here should submit our values? Or do you mean something else? I start with the values I use: > 1. storage type, > MySQL > 2. total size, > Of the storage used? This depends. My current setup uses 334.8 MiB INNODB data. I have clustered my DSPAM installation. Right now my main installation runs MySQL 5.1.43 in Master/Master mode. I do however have other installations that use PostgreSQL and other databases. > 3. number of people, who contribute to create dspam data, > On my setup: A few 100 domains. > 4. number of typically used languages, > In my setup: German (main language), French, Italian, English, Russian and some other slavic languages. > 5. how purges happen, > Daily at 4am with the dspam_maintenance.sh script -> http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=tree;f=contrib/dspam_maintenance;hb=HEAD > 6. normally used dspam mode, > * Algorithm graham burton * PValue bcr * Using DSPAM as a content filter in Postfix. * DSPAM runs in server/client mode (using LMTP). * Using a global merged group (merged group has right now 1'616'468 tokens) * Stats of the merged group: TP True Positives: 0 TN True Negatives: 0 FP False Positives: 1 FN False Negatives: 0 SC Spam Corpusfed: 129498 NC Nonspam Corpusfed: 62035 TL Training Left: 0 SHR Spam Hit Rate 100.00% HSR Ham Strike Rate: 100.00% PPV Positive predictive value: 0.00% OCA Overall Accuracy: 0.00% * The data from the merged group is comming from +/- 5 millions of spam mails and 3 millions of ham mails. * Training of the merged group is done with a honeypot that captures spam and by feeding some outbound mail as ham and processing other sources for ham (news groups, etc. Normalizing training data using "boosting" techniques and only training messages that are in a certain threshold and are all checked first by me before DSPAM is allowed to train them. Training corpi does not have any newsletters and such things (I don't train newsletters)). * Training is done using custom made script that uses TONE (train on error or near error) technique and uses a asymetric thickness for ham/spam and does double sided training (a technique used to boost accuracy). * DSPAM home is shared on a GlusterFS storage that has AFR/Replicate. > 7. Tokenizer? > OSB > Well, maybe its not that hard... I yesterday understood, that I actually need > a different machine to use dspam for myself (well, that is specific, I store > things in RAM actually). So it would be nice to see some infos and get a > picture of what should I count on. ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
