On Tue, 18 Jul 2006, Chr. v. Stuckrad wrote:
I'm a postmaster working with spamassassin (now debian sarge)
for the last years, we habe one filter-host for all mails,
so at the moment we have only one global bayes-database..
We are a department for math and computer science and so we get zillions
of spam for all addresses 'known on the net' and we get ham for lots of
different 'themes' for different workgroups in diverse languages (mostly
german of course, being Berlin Germany).
Not beeing allowed to peek into other users mailboxes I have no
'representative ham corpus' but only my own, which seems to be
very postmaster-specific, while I seem to get a typical average
of spams (because my address already existed on a 'News' server :-).
Can somebody tell me, whether the bayes-database's accuray does
deteriorate by feeding it 'only my spam' (my false negatives) and
not feeding it the (to me unknown) typical hams.
Yes, feeding your Bayes database only spam is a bad idea.
As an analogy, imagine that you are a policeman trying to
learn to identify dangerous and violent people. You examine
100 violent criminals, and all of them are carrying knives.
You don't examine anyone else, though, so based on your
sample, anyone carrying a knife must be a violent criminal.
The reasoning for this is simple: every time you have seen
someone carrying a knife, they have been a violent criminal,
so knife-carrying correlates perfectly with being a criminal.
Now imagine that you see a chef. He is carrying a knife, but
what does your experience tell you about him? You have never
seen anyone *else* carrying a knife who wasn't a criminal,
so this new guy must be a criminal too. But he's not: he's
just a chef.
This problem only arises with words (tokens) that could be
expected to appear in both spam and ham. It isn't a problem
for words that are names of "performance-enhancing" drugs.
But it is a problem for neutral words. For example, a word
like "link" or "today" might occur in both ham and spam, so
it doesn't indicate much about which type of message it is.
But if you train your Bayes database only with spam, it will
see neutral words as strongly associated with spam. Basically,
by doing that, you will give it a very negative view of the
world, where everything looks like spam.
(This is all assuming, of course, that your Bayes database is
empty when you train it with spam only.)
To me it lately seems to slowly skew to let more and more spam
through, instead of 'catching' it. Is this typical? Do I have
to recreate the database? Or do I need to get 'ham from a set
of typical users' to balance the database? OR are there typical
values for bayes_auto_learn_threshold_{non,}spam, different from
the defatult, to use in my case?
To answer that question, we'd first have to know whether
Bayes is really at fault here. Perhaps there are other
configuration changes that need to be made. Do you have the
latest SpamAssassin, and have you enabled some network tests
like dcc or razor and some RBLs? Those should be carrying
some of the load; you shouldn't be relying on Bayes only,
because these days Bayes alone isn't sufficient.
If your Bayes database really is messed up, personally I would
recommend that you just wipe it and start over. If you have
the proper setup, then you can be confident it will be trained
correctly. Yes, you would be throwing away existing data,
but what you get in exchange is the knowledge that the data
you *do* have is worthwhile.
Just curious why so many spams get through to me ...
(i.e. around 10 false negatives relative to 90 marked as spam,
which ist 'relatively bad' compared to many opinions on the list)
Well, there are probably several different explanations.
The best place to start is by looking at the spams that get
through and how they scored, especially comparing that to what
scores others get on the same messages or similar ones.
- Logan