Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Dirk Bonengel Tue, 18 Jul 2006 08:14:24 -0700

Stucki,

did you investigate auto-learning? This might let your system learn hamas well as spam. Works fine here (same situation - gateway server to aLotus Notes system, no feedback loop possible)

As far as I recall, SA starts using its Bayes data only after havinglearned at least 200 ham and spam each. I guess this applies toper-user-databases as well, which in turn means many users will never(or late) accumulate enough data to use bayes effectively. I'd stick toa global DB....

If I was in your position, I'd try to switch over to a system like MaiaMailguard that keeps a copy of each mail in a database and users canconfirm and/or correct the underlying SpamAssassin engine's decisions.This system uses a singel bayes DB....Works fine at a customer of oursthat uses some weird proprietary document managing software


Hope my plugin works well....feedback off-list would be welcome

Dirk

Chr. v. Stuckrad schrieb:

On Mon, 17 Jul 2006, Logan Shaw wrote:

...

someone carrying a knife, they have been a violent criminal,
so knife-carrying correlates perfectly with being a criminal.

Now imagine that you see a chef.  He is carrying a knife, but

(Good point: [OT: I even know people who react that way on TV-News] :-)

...

by doing that, you will give it a very negative view of the
world, where everything looks like spam.

(This is all assuming, of course, that your Bayes database is
empty when you train it with spam only.)


Assuming this scenario I ORIGINALLY started the database
on ham of a long backlog of MY mail, which THEN had enough
spam AND ham to start with, so it's not as bad as would be possible;
but since the last 'fresh start' I 'updated' only the false negatives.
And checking near 6000 (low scoring) Spams a week I found only
'classical false positives' (like of this list :-) and for months
*I* did not loose(sort away) anything important. But may be
one in two months one of our power-users complains about a real
false positive, and if I'm allowed, I feed THAT one in.

configuration changes that need to be made.  Do you have the
latest SpamAssassin, and have you enabled some network tests

not the latest, because debian 'stable' is not fast in
the uptake of new versions.  May be I should move to the
volatile packages ...

like DCC or razor and some RBLs?  Those should be carrying
some of the load; you shouldn't be relying on Bayes only,


Of course. razor, pyzor, dcc, and the newer german iX-plugin,
and RBLs do catch lots of mails pushing thousands to scores
above 20 :-)

If your Bayes database really is messed up, personally I would

...

you *do* have is worthwhile.


Hmmmm.... may be on one of the next 'maintenance days',
when (nearly) everything is down for a while, so nothing
will slip through during training ...

But this 'keeps' me thinking about the different 'hams' in
our department. Some are french and some even might be Chinese.
So if I train again with *my* mail (postmaster-problems and
a bit of half-private stuff) the database might start anew
skewed 'against' real hams of other parts of the department!
(While I think 'my spam' will be fine to train with).

The only 'real solution' might be to switch to a SQL-Database
and 'bayes-per-user', but then I'd have to 'train' hundreds
of Students how to 'train' their own databases themselves :-))

...

Well, there are probably several different explanations.
The best place to start is by looking at the spams that get
through and how they scored, especially comparing that to what
scores others get on the same messages or similar ones.


That's one of the problems here. The mail-filter(-host) runs on old
amavis-perl and does not include the whole scoring headers in the mail,
but only a marking header with the score itself.  So when I later check
the same mail (cleaned of the previous marking) I get completely
different (mostly horrendously higher) scores for the same, but without
really seeing the differences.  Seemingly the later in time an 'one of a
series spam' comes in, the more of the dynamic systems have learned it
and score it.  I nearly believe we often are 'at one end' of some
'lists to be spammed', so we get it 'fresh', and only the first users
are hit, others get it 'after' the filter dynamically chokes down on it
and so the different users do complain about different 'slips'. Sometimes
it *seems* as if spammers work their list alphabetically, so user "a*"
is getting something often, which "w*" never sees, and other way around
too :-)

Thanks Stucki

Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Reply via email to