Tony Earnshaw wrote:
Shane Kumpf wrote, on 03. mar 2007 19:06:
One of my pop accounts that I pull down via fetchmail which isn’t
controlled by me just changed there spam filtering software. In the
process of this change they went from bouncing a lot of suspected
spam, to now just quarantining it. Since I don’t use there
quarantine, obviously I’m ending up with quite a bit more spam than I
used to.
FWIW I'm in more or less the same position as you, doing more or less
the same. I'm responsible for a high school email system (1150+
nominal users, around 350 active mailers) running dspam as a daemon
with remarkable accuracy. I had my private mail address on the
school's server.
For my home machine (Red Hat RHAS4) in November/December last I
decided to take my private mail off the school's server and activate
my ISP's POP account. The ISP has a spam and virus filtering service
but I don't trust it, I trust myself.
I'm running more or less the same basic system that I have at school,
but I use Fetchmail 6.3.6. I have a Postfix 2.3.6 MTA calling
amavisd-new 2.4.5 with ClamAV 0.90.1 and BitDefender-Console-Antivirus
7.3-1. A Postfix smtpd listener passes the mail to dspam CVS/MySQL
4.1.20 which scans it and passes back to Postfix, which gives it to
maildrop for IMAP distribution.
Unfortunately I'm not getting much spam. I do what I can to aggravate
things to get more, like posting on newsgroups (which used to work
well) with a throwaway address, but that mostly gets me a few "virus"
(also phishing stuff caught by ClamAV).
Dspam is having a lot of trouble classifying these new messages as
spam. The strange thing is that all this spam looks very similar to
the spam it is catching. I’m starting to wonder if I should wipe my
databases and start fresh, it’s been about a month and it doesn’t
seem to be getting any better. I’m getting roughly 200 pieces of
spam a day now. My stats have dropped considerably from about 90%
accuracy to less than 70 as you will see. Do you think that if I
continue to train it will get better, or do to the size and age of my
database that this new spam will have trouble getting classified?
Any info I can provide let me know.
I decided to start with a completely empty dspam db and see what
happened and I must say I'm pleased with the result up to now, dspam
is learning relatively fast and beginning to judge sensibly, even to
the extent that it's interpolating correctly (e.g. if it's had spam in
Greek or French it recognizes spam in Spanish but leaves the local
Dutch stuff alone - I haven't had any Dutch spam to date, though).
TP True Positives: 17731
TN True Negatives: 21733
FP False Positives: 10937
FN False Negatives: 6361
SC Spam Corpusfed: 3741
NC Nonspam Corpusfed: 1
TL Training Left: 0
SHR Spam Hit Rate 73.60%
HSR Ham Strike Rate: 33.48%
OCA Overall Accuracy: 69.53%
Mine started all askew but as of now it's:
TP True Positives: 45
TN True Negatives: 5940
FP False Positives: 0
FN False Negatives: 53
SC Spam Corpusfed: 1
NC Nonspam Corpusfed: 0
TL Training Left: 0
SHR Spam Hit Rate 45.92%
HSR Ham Strike Rate: 0.00%
OCA Overall Accuracy: 99.12%
At school it's:
TP True Positives: 12914
TN True Negatives: 87136
FP False Positives: 384
FN False Negatives: 344
SC Spam Corpusfed: 3311
NC Nonspam Corpusfed: 3002
TL Training Left: 0
SHR Spam Hit Rate 97.41%
HSR Ham Strike Rate: 0.44%
OCA Overall Accuracy: 99.28%
So not a wild difference between corpus feeding or not. The school
gets most correspondence in Dutch and to begin with (starting October
last) dspam thought all Dutch stuff was spam (all the corpus was
English) and got mixed up, but it's mostly judging well now.
I'm using a shared group for both sites and my home dspam.conf looks
like:
Home /var/dspam
DeliveryHost 127.0.0.1
DeliveryPort 10026
DeliveryIdent dspam-out
DeliveryProto SMTP
FallbackDomains on
OnFail error
Trust root
Trust nobody
Debug *
DebugOpt process spam fp innocent
TrainingMode toe
TestConditionalTraining on
Feature tb=3
Feature whitelist
Feature noise
Algorithm graham burton
PValue graham
SupressWebStats on
ImprobabilityDrive on
Preference "signatureLocation=headers" # 'message' or 'headers'
AllowOverride trainingMode
AllowOverride spamAction spamSubject
AllowOverride statisticalSedation
AllowOverride enableBNR
AllowOverride enableWhitelist
AllowOverride showFactors
AllowOverride optIn optOut
AllowOverride whitelistThreshold
AllowOverride makeCorpus
AllowOverride fallbackDomain
AllowOverride trainingMode
MySQLServer /var/lib/mysql/mysql.sock
MySQLUser dspam
MySQLPass dspam
MySQLDb dspamdb
MySQLConnectionCache 10
IgnoreHeader DomainKey-Signature
IgnoreHeader X-DKIM
IgnoreHeader X-Virus-Scanned
IgnoreHeader Delivered-To
IgnoreHeader In-Reply-To
IgnoreHeader X-OriginalArrivalTime
IgnoreHeader X-Disclaimer
IgnoreHeader X-Mailman-Approved-At
IgnoreHeader Archive
IgnoreHeader List-Post
IgnoreHeader List-Subscribe
IgnoreHeader List-Unsubscribe
IgnoreHeader List-Help
IgnoreHeader List-Id
IgnoreHeader Message-ID
Notifications on
PurgeSignatures 21 # Stale signatures
PurgeNeutral 90 # Tokens with neutralish probabilities
PurgeUnused 90 # Unused tokens
PurgeHapaxes 30 # Tokens with less than 5 hits (hapaxes)
PurgeHits1S 15 # Tokens with only 1 spam hit
PurgeHits1I 15 # Tokens with only 1 innocent hit
LocalMX 127.0.0.1 192.168.0.3 213.75.3.22 213.10.163.78
SystemLog on
UserLog on
Opt out
TrackSources spam
Broken lineStripping
MaxMessageSize 1024000
ServerHost 127.0.0.1
ServerPort 24
ServerQueueSize 32
ServerPID /var/run/dspam.pid
ServerMode standard
ServerParameters "--deliver=innocent,spam -d %u"
ServerIdent "dspam-in"
ProcessorBias on
Best,
--Tonni
( Holy cow! I have never had accuracy that high. :( )