On Tue, 14 Oct 2014 23:54:56 +0200 Axb wrote: > On 10/14/2014 05:07 PM, RW wrote: > > On Tue, 14 Oct 2014 13:58:27 +0200 > > Axb wrote: > > > >> On 10/14/2014 01:51 PM, RW wrote: > >>> On Tue, 14 Oct 2014 10:44:51 +0200 > >>> Axb wrote: > >>> > >>>> > >>>> have you verified that some of these are not included? > >>>> > >>>> X-Originating-IP will not be included as it can be used to help > >>>> detect ham or spam > >>> > >>> It's really no different to other headers you are ignoring. > >> > >> for example, if you get a flood of 419s from the same source, you > >> may want it to be tokenized... > > > > > > As I do with, for example: > > > > X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12] > > > > in this spam Bayes found > > > > 0.999-4--HX-AntiAbuse:32007 > > > > These numbers seem to be very good indicators for me. > > > > > > Most of the headers in the file have never appeared in my ham, so > > they'll be pure spam indicators if they are ever faked. In general > > it's difficult for a spammer to gain an overall advantage against > > an average per user database using faked headers. > > > > Whatever the merits of this on system-wide Bayes (if any beyond > > reducing token count), I think it would have a negative effect on > > per user Bayes. > > > > oooooooooooook.. > now here's a suprise (it's all in the code :)
It wasn't a surprise to me. Many of them I agree with, some I don't. On the whole I don't care enough to patch it. I'm not against ignoring things that obviously, or empirically, don't help, what I didn't want was a huge list being imposed on everyone, which was your original plan. I certainly would patch it if X-Delivered-To were included; Delivered-To, (?:X-)?Envelope-To definitely shouldn't be there IMO. > |Subject # not worth a tiny gain vs. to db size increase I'd forgotten about that one. The subject is already tokenized through the body. And it probably made a lot of sense when spammers weren't taking statistical filters seriously. But word frequencies can be different in the subject, and spammers are now very good at denying Bayes useful tokens. I think it's unfortunate that that exclusion is unconditional. The trouble is that a lot of this is that it's a judgement about cost/benefit. But for me Bayes uses 20 millipennies of storage, and catches 72% of spam at BAYES_9*, and Bogofilter uses 200 millipennies but catches 94% of spam. To me that's 180 millipennies well spent and I wouldn't begrudge Bayes a similar amount - I might even go to whole pennies.
