On 18-Jul-06, at 11:14 AM, Logan Shaw <[EMAIL PROTECTED]> wrote:

On Tue, 18 Jul 2006, Chr. v. Stuckrad wrote:

I'm a postmaster working with spamassassin (now debian sarge)

for the last years, we habe one filter-host for all mails,

so at the moment we have only one global bayes-database..


We are a department for math and computer science and so we get zillions

of spam for all addresses 'known on the net' and we get ham for lots of

different 'themes' for different workgroups in diverse languages (mostly

german of course, being Berlin Germany).

Not beeing allowed to peek into other users mailboxes I have no

'representative ham corpus' but only my own, which seems to be

very postmaster-specific, while I seem to get a typical average

of spams (because my address already existed on a 'News' server :-).


Can somebody tell me, whether the bayes-database's accuray does

deteriorate by feeding it 'only my spam' (my false negatives) and

not feeding it the (to me unknown) typical hams.


Yes, feeding your Bayes database only spam is a bad idea.


As an analogy, imagine that you are a policeman trying to

learn to identify dangerous and violent people.  You examine

100 violent criminals, and all of them are carrying knives.

You don't examine anyone else, though, so based on your

sample, anyone carrying a knife must be a violent criminal.

The reasoning for this is simple:  every time you have seen

someone carrying a knife, they have been a violent criminal,

so knife-carrying correlates perfectly with being a criminal.


Now imagine that you see a chef.  He is carrying a knife, but

what does your experience tell you about him?  You have never

seen anyone *else* carrying a knife who wasn't a criminal,

so this new guy must be a criminal too.  But he's not:  he's

just a chef.


This problem only arises with words (tokens) that could be

expected to appear in both spam and ham.  It isn't a problem

for words that are names of "performance-enhancing" drugs.

But it is a problem for neutral words.  For example, a word

like "link" or "today" might occur in both ham and spam, so

it doesn't indicate much about which type of message it is.

But if you train your Bayes database only with spam, it will

see neutral words as strongly associated with spam.  Basically,

by doing that, you will give it a very negative view of the

world, where everything looks like spam.


(This is all assuming, of course, that your Bayes database is

empty when you train it with spam only.)


To me it lately seems to slowly skew to let more and more spam

through, instead of 'catching' it.  Is this typical?  Do I have

to recreate the database? Or do I need to get 'ham from a set

of typical users' to balance the database? OR are there typical

values for bayes_auto_learn_threshold_{non,}spam, different from

the defatult, to use in my case?


To answer that question, we'd first have to know whether

Bayes is really at fault here.  Perhaps there are other

configuration changes that need to be made.  Do you have the

latest SpamAssassin, and have you enabled some network tests

like dcc or razor and some RBLs?  Those should be carrying

some of the load; you shouldn't be relying on Bayes only,

because these days Bayes alone isn't sufficient.


If your Bayes database really is messed up, personally I would

recommend that you just wipe it and start over.  If you have

the proper setup, then you can be confident it will be trained

correctly.  Yes, you would be throwing away existing data,

but what you get in exchange is the knowledge that the data

you *do* have is worthwhile.


Just curious why so many spams get through to me ...

(i.e. around 10 false negatives relative to 90 marked as spam,

which ist 'relatively bad' compared to many opinions on the list)


Well, there are probably several different explanations.

The best place to start is by looking at the spams that get

through and how they scored, especially comparing that to what

scores others get on the same messages or similar ones.


  - Logan


Great analogy Logan and reading it only reinforces by belief that Stucki's problem may not be due to a DB skewed by too much spam. Actually the opposite result would probably be true. If the DB was skewed with too much spam the result would normally be too many false positives. The DB would be skewed by too many tokens for 'neutral' words. 

Stucki, maybe Spamassassin is working better then you think and the answer to your false negatives is to lower the score at which a message is considered spam. Have you examined the scores assigned to your ham messages? 

Assuming your spam score level is set at 7 and all your ham is scoring below 4 maybe you should adjust the score to 5.

Just something to consider.

--
Gino Cerullo

Pixel Point Studios
21 Chesham Drive
Toronto, ON  M3M 1W6

T: 416-247-7740
F: 416-247-7503


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to