Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Chr. v. Stuckrad Tue, 18 Jul 2006 04:52:25 -0700

On Mon, 17 Jul 2006, Logan Shaw wrote:

...
> someone carrying a knife, they have been a violent criminal,
> so knife-carrying correlates perfectly with being a criminal.
> 
> Now imagine that you see a chef.  He is carrying a knife, but
(Good point: [OT: I even know people who react that way on TV-News] :-)


...
> by doing that, you will give it a very negative view of the
> world, where everything looks like spam.
> 
> (This is all assuming, of course, that your Bayes database is
> empty when you train it with spam only.)

Assuming this scenario I ORIGINALLY started the database
on ham of a long backlog of MY mail, which THEN had enough
spam AND ham to start with, so it's not as bad as would be possible;
but since the last 'fresh start' I 'updated' only the false negatives.
And checking near 6000 (low scoring) Spams a week I found only
'classical false positives' (like of this list :-) and for months
*I* did not loose(sort away) anything important. But may be
one in two months one of our power-users complains about a real
false positive, and if I'm allowed, I feed THAT one in.

> configuration changes that need to be made.  Do you have the
> latest SpamAssassin, and have you enabled some network tests
not the latest, because debian 'stable' is not fast in
the uptake of new versions.  May be I should move to the
volatile packages ...
> like DCC or razor and some RBLs?  Those should be carrying
> some of the load; you shouldn't be relying on Bayes only,

Of course. razor, pyzor, dcc, and the newer german iX-plugin,
and RBLs do catch lots of mails pushing thousands to scores
above 20 :-)

> If your Bayes database really is messed up, personally I would
...
> you *do* have is worthwhile.

Hmmmm.... may be on one of the next 'maintenance days',
when (nearly) everything is down for a while, so nothing
will slip through during training ...

But this 'keeps' me thinking about the different 'hams' in
our department. Some are french and some even might be Chinese.
So if I train again with *my* mail (postmaster-problems and
a bit of half-private stuff) the database might start anew
skewed 'against' real hams of other parts of the department!
(While I think 'my spam' will be fine to train with).

The only 'real solution' might be to switch to a SQL-Database
and 'bayes-per-user', but then I'd have to 'train' hundreds
of Students how to 'train' their own databases themselves :-))

...
> Well, there are probably several different explanations.
> The best place to start is by looking at the spams that get
> through and how they scored, especially comparing that to what
> scores others get on the same messages or similar ones.

That's one of the problems here. The mail-filter(-host) runs on old
amavis-perl and does not include the whole scoring headers in the mail,
but only a marking header with the score itself.  So when I later check
the same mail (cleaned of the previous marking) I get completely
different (mostly horrendously higher) scores for the same, but without
really seeing the differences.  Seemingly the later in time an 'one of a
series spam' comes in, the more of the dynamic systems have learned it
and score it.  I nearly believe we often are 'at one end' of some
'lists to be spammed', so we get it 'fresh', and only the first users
are hit, others get it 'after' the filter dynamically chokes down on it
and so the different users do complain about different 'slips'. Sometimes
it *seems* as if spammers work their list alphabetically, so user "a*"
is getting something often, which "w*" never sees, and other way around
too :-)

Thanks Stucki

-- 
Christoph von Stuckrad      * * |nickname |<[EMAIL PROTECTED]>   \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(days):+49 30 838-5 57 78|
Mathematik & Informatik EDV |\ *|if online|Tel(else):+49 30 77 39 66 00|
Arnimallee 6 / 14195 Berlin * * |on IRCnet|Fax(alle):+49 30 838-75 454/

Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Reply via email to