Re: [Dspam-user] Poll about database sizes

Stevan Bajić Sun, 14 Feb 2010 04:56:59 -0800

On Sun, 14 Feb 2010 11:49:20 +0000
Kārlis Repsons <[email protected]> wrote:


> People,
>
Hallo Kārlis,


> I know it depends on quite many factors in total, but anyway, could we make a 
> small list of values and info in here like this:
> 
what do you mean? We all here should submit our values? Or do you mean 
something else?


I start with the values I use:

> 1. storage type,
>
MySQL


> 2. total size,
>
Of the storage used? This depends. My current setup uses 334.8 MiB INNODB data. 
I have clustered my DSPAM installation. Right now my main installation runs 
MySQL 5.1.43 in Master/Master mode. I do however have other installations that 
use PostgreSQL and other databases.


> 3. number of people, who contribute to create dspam data,
>
On my setup: A few 100 domains.


> 4. number of typically used languages,
>
In my setup: German (main language), French, Italian, English, Russian and some 
other slavic languages.


> 5. how purges happen,
>
Daily at 4am with the dspam_maintenance.sh script -> 
http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=tree;f=contrib/dspam_maintenance;hb=HEAD


> 6. normally used dspam mode,
>
* Algorithm graham burton
* PValue bcr
* Using DSPAM as a content filter in Postfix.
* DSPAM runs in server/client mode (using LMTP).
* Using a global merged group (merged group has right now 1'616'468 tokens)
* Stats of the merged group:
   TP True Positives:                     0
   TN True Negatives:                     0
   FP False Positives:                    1
   FN False Negatives:                    0
   SC Spam Corpusfed:                129498
   NC Nonspam Corpusfed:              62035
   TL Training Left:                      0
   SHR Spam Hit Rate                100.00%
   HSR Ham Strike Rate:             100.00%
   PPV Positive predictive value:     0.00%
   OCA Overall Accuracy:              0.00%
* The data from the merged group is comming from +/- 5 millions of spam mails 
and 3 millions of ham mails.
* Training of the merged group is done with a honeypot that captures spam and 
by feeding some outbound mail as ham and processing other sources for ham (news 
groups, etc. Normalizing training data using "boosting" techniques and only 
training messages that are in a certain threshold and are all checked first by 
me before DSPAM is allowed to train them. Training corpi does not have any 
newsletters and such things (I don't train newsletters)).
* Training is done using custom made script that uses TONE (train on error or 
near error) technique and uses a asymetric thickness for ham/spam and does 
double sided training (a technique used to boost accuracy).
* DSPAM home is shared on a GlusterFS storage that has AFR/Replicate.


> 7. Tokenizer?
> 
OSB


> Well, maybe its not that hard... I yesterday understood, that I actually need 
> a different machine to use dspam for myself (well, that is specific, I store 
> things in RAM actually). So it would be nice to see some infos and get a 
> picture of what should I count on.

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Poll about database sizes

Reply via email to