On Tue, 14 Oct 2014 23:54:56 +0200
Axb wrote:

> On 10/14/2014 05:07 PM, RW wrote:
> > On Tue, 14 Oct 2014 13:58:27 +0200
> > Axb wrote:
> >
> >> On 10/14/2014 01:51 PM, RW wrote:
> >>> On Tue, 14 Oct 2014 10:44:51 +0200
> >>> Axb wrote:
> >>>
> >>>>
> >>>> have you verified that some of these are not included?
> >>>>
> >>>> X-Originating-IP will not be included as it can be used to help
> >>>> detect ham or spam
> >>>
> >>> It's really no different to other headers you are ignoring.
> >>
> >> for example, if you get a flood of 419s from the same source, you
> >> may want it to be tokenized...
> >
> >
> > As I do with, for example:
> >
> >    X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
> >
> > in this spam Bayes found
> >
> >    0.999-4--HX-AntiAbuse:32007
> >
> > These numbers seem to be very good indicators for me.
> >
> >
> > Most of the headers in the file have never appeared in my ham, so
> > they'll be pure spam indicators if they are ever faked. In general
> > it's difficult for a spammer to gain an overall advantage against
> > an average per user database using faked headers.
> >
> > Whatever the merits of this on system-wide Bayes (if any beyond
> > reducing token count), I think it would have a negative effect on
> > per user Bayes.
> >
> 
> oooooooooooook..
> now here's a suprise (it's all in the code :)

It wasn't a surprise to me. Many of them I agree with, some I don't. On
the whole I don't care enough to patch it. 

I'm not against ignoring things that obviously, or empirically,  don't
help, what I didn't want was a huge list being imposed on everyone,
which was your original plan. 

I certainly would patch it if  X-Delivered-To were included;
Delivered-To, (?:X-)?Envelope-To definitely  shouldn't be there IMO.

>    |Subject      # not worth a tiny gain vs. to db size increase

I'd forgotten about that one. The subject is already tokenized
through the body. And it probably made a lot of sense when spammers
weren't taking statistical filters seriously.   But word frequencies
can be different in the subject, and spammers are now very good at
denying Bayes useful tokens. I think it's unfortunate that that
exclusion is unconditional.

The trouble is that a lot of this is that it's a judgement about
cost/benefit. But for me Bayes uses 20 millipennies of storage, and
catches 72% of spam at BAYES_9*, and Bogofilter uses 200 millipennies
but catches 94% of spam. To me that's 180 millipennies well spent and
I wouldn't begrudge Bayes a similar amount - I might even go to
whole pennies.

Reply via email to