Re: [dspam-users] dspam accuracy

Kyle Johnson Mon, 05 Mar 2007 20:19:06 -0800

Tony Earnshaw wrote:

Shane Kumpf wrote, on 03. mar 2007 19:06:
One of my pop accounts that I pull down via fetchmail which isn’tcontrolled by me just changed there spam filtering software. In theprocess of this change they went from bouncing a lot of suspectedspam, to now just quarantining it. Since I don’t use therequarantine, obviously I’m ending up with quite a bit more spam than Iused to.
FWIW I'm in more or less the same position as you, doing more or lessthe same. I'm responsible for a high school email system (1150+nominal users, around 350 active mailers) running dspam as a daemonwith remarkable accuracy. I had my private mail address on theschool's server.
For my home machine (Red Hat RHAS4) in November/December last Idecided to take my private mail off the school's server and activatemy ISP's POP account. The ISP has a spam and virus filtering servicebut I don't trust it, I trust myself.
I'm running more or less the same basic system that I have at school,but I use Fetchmail 6.3.6. I have a Postfix 2.3.6 MTA callingamavisd-new 2.4.5 with ClamAV 0.90.1 and BitDefender-Console-Antivirus7.3-1. A Postfix smtpd listener passes the mail to dspam CVS/MySQL4.1.20 which scans it and passes back to Postfix, which gives it tomaildrop for IMAP distribution.
Unfortunately I'm not getting much spam. I do what I can to aggravatethings to get more, like posting on newsgroups (which used to workwell) with a throwaway address, but that mostly gets me a few "virus"(also phishing stuff caught by ClamAV).
Dspam is having a lot of trouble classifying these new messages asspam. The strange thing is that all this spam looks very similar tothe spam it is catching. I’m starting to wonder if I should wipe mydatabases and start fresh, it’s been about a month and it doesn’tseem to be getting any better. I’m getting roughly 200 pieces ofspam a day now. My stats have dropped considerably from about 90%accuracy to less than 70 as you will see. Do you think that if Icontinue to train it will get better, or do to the size and age of mydatabase that this new spam will have trouble getting classified?Any info I can provide let me know.
I decided to start with a completely empty dspam db and see whathappened and I must say I'm pleased with the result up to now, dspamis learning relatively fast and beginning to judge sensibly, even tothe extent that it's interpolating correctly (e.g. if it's had spam inGreek or French it recognizes spam in Spanish but leaves the localDutch stuff alone - I haven't had any Dutch spam to date, though).
                TP True Positives:          17731

                TN True Negatives:          21733

                FP False Positives:         10937

                FN False Negatives:          6361

                SC Spam Corpusfed:           3741

                NC Nonspam Corpusfed:           1

                TL Training Left:               0

                SHR Spam Hit Rate          73.60%

                HSR Ham Strike Rate:       33.48%

                OCA Overall Accuracy:      69.53%
Mine started all askew but as of now it's:


                TP True Positives:             45
                TN True Negatives:           5940
                FP False Positives:             0
                FN False Negatives:            53
                SC Spam Corpusfed:              1
                NC Nonspam Corpusfed:           0
                TL Training Left:               0
                SHR Spam Hit Rate          45.92%
                HSR Ham Strike Rate:        0.00%
                OCA Overall Accuracy:      99.12%

At school it's:
                TP True Positives:          12914
                TN True Negatives:          87136
                FP False Positives:           384
                FN False Negatives:           344
                SC Spam Corpusfed:           3311
                NC Nonspam Corpusfed:        3002
                TL Training Left:               0
                SHR Spam Hit Rate          97.41%
                HSR Ham Strike Rate:        0.44%
                OCA Overall Accuracy:      99.28%
So not a wild difference between corpus feeding or not. The schoolgets most correspondence in Dutch and to begin with (starting Octoberlast) dspam thought all Dutch stuff was spam (all the corpus wasEnglish) and got mixed up, but it's mostly judging well now.
I'm using a shared group for both sites and my home dspam.conf lookslike:
Home /var/dspam
DeliveryHost        127.0.0.1
DeliveryPort        10026
DeliveryIdent       dspam-out
DeliveryProto       SMTP
FallbackDomains on
OnFail error
Trust root
Trust nobody
Debug *
DebugOpt process spam fp innocent
TrainingMode toe
TestConditionalTraining on
Feature tb=3
Feature whitelist
Feature noise
Algorithm graham burton
PValue graham
SupressWebStats on
ImprobabilityDrive on
Preference "signatureLocation=headers"  # 'message' or 'headers'
AllowOverride trainingMode
AllowOverride spamAction spamSubject
AllowOverride statisticalSedation
AllowOverride enableBNR
AllowOverride enableWhitelist
AllowOverride showFactors
AllowOverride optIn optOut
AllowOverride whitelistThreshold
AllowOverride makeCorpus
AllowOverride fallbackDomain
AllowOverride trainingMode
MySQLServer     /var/lib/mysql/mysql.sock
MySQLUser               dspam
MySQLPass               dspam
MySQLDb                 dspamdb
MySQLConnectionCache    10
IgnoreHeader DomainKey-Signature
IgnoreHeader X-DKIM
IgnoreHeader X-Virus-Scanned
IgnoreHeader Delivered-To
IgnoreHeader In-Reply-To
IgnoreHeader X-OriginalArrivalTime
IgnoreHeader X-Disclaimer
IgnoreHeader X-Mailman-Approved-At
IgnoreHeader Archive
IgnoreHeader List-Post
IgnoreHeader List-Subscribe
IgnoreHeader List-Unsubscribe
IgnoreHeader List-Help
IgnoreHeader List-Id
IgnoreHeader Message-ID
Notifications   on
PurgeSignatures 21          # Stale signatures
PurgeNeutral    90          # Tokens with neutralish probabilities
PurgeUnused     90          # Unused tokens
PurgeHapaxes    30          # Tokens with less than 5 hits (hapaxes)
PurgeHits1S     15          # Tokens with only 1 spam hit
PurgeHits1I     15          # Tokens with only 1 innocent hit
LocalMX 127.0.0.1 192.168.0.3 213.75.3.22 213.10.163.78
SystemLog on
UserLog   on
Opt out
TrackSources spam
Broken lineStripping
MaxMessageSize 1024000
ServerHost              127.0.0.1
ServerPort              24
ServerQueueSize 32
ServerPID               /var/run/dspam.pid
ServerMode standard
ServerParameters       "--deliver=innocent,spam -d %u"
ServerIdent            "dspam-in"
ProcessorBias on

Best,

--Tonni

(  Holy cow!  I have never had accuracy that high. :(  )

Re: [dspam-users] dspam accuracy

Reply via email to